ZhangZhihui's Blog  
Does Hive SQL support recursive CTE?
 
ChatGPT said:

Short answer: Apache Hive does not support true recursive CTEs (i.e., WITH RECURSIVE ...) like PostgreSQL, SQL Server, or Oracle.

✔ What Hive supports

Hive supports non-recursive CTEs, for example:

WITH t AS (
  SELECT * FROM my_table WHERE col > 10
)
SELECT * FROM t;

This is just syntactic sugar—no recursion.

✘ What Hive does not support

Hive does not support:

WITH RECURSIVE t AS (
  SELECT ...
  UNION ALL
  SELECT ... FROM t
)

Attempts to run such SQL will give a parse error.

✔ Workarounds

Depending on what you need, common solutions are:

1. Use a series of self-joins (if depth is fixed)

For small, fixed-depth hierarchies, manually unroll the recursion.

2. Use a UDTF / explode approach

For hierarchical paths stored as arrays or nested structures.

3. Use a MapReduce / Spark job instead of SQL

If recursion depth is dynamic, using PySpark or Hive + Tez UDTFs is more practical.

4. Use a loop at the application level

Iteratively run queries and store intermediate results into temp tables:

level 0 → table0
level 1 → table1
level 2 → table2
...
until no new rows

5. Use Spark SQL (supports recursive CTE since Spark 3.4)

If your environment includes Spark, this is often the easiest alternative.


If you tell me your use case (e.g., "query parent-child hierarchy", "traverse tree", "generate date range", etc.), I can help you write a Hive-compatible solution.

 
You said:
Query parent-child hierarchy
 
ChatGPT said:

Hive SQL can query a parent-child hierarchy, but because Hive does not support recursive CTE, you must use one of the following approaches depending on your data and your requirements.

Below are the three most practical solutions with examples.
If you tell me your table schema, I can tailor the SQL exactly.


✅ Scenario 1 — Depth is small and known (e.g., max 5 levels)

You can unroll the recursion manually.

Suppose the table is:

id    parent_id    name

Example (3 levels max):

WITH
lvl1 AS (
  SELECT id, parent_id, name, CAST(name AS string) AS path
  FROM org
  WHERE parent_id IS NULL
),
lvl2 AS (
  SELECT c.id, c.parent_id, c.name,
         CONCAT(p.path, ' > ', c.name) AS path
  FROM org c
  JOIN lvl1 p ON c.parent_id = p.id
),
lvl3 AS (
  SELECT c.id, c.parent_id, c.name,
         CONCAT(p.path, ' > ', c.name) AS path
  FROM org c
  JOIN lvl2 p ON c.parent_id = p.id
)
SELECT * FROM lvl1
UNION ALL
SELECT * FROM lvl2
UNION ALL
SELECT * FROM lvl3;

 


✅ Scenario 2 — Unknown depth (dynamic)

Approach: Iterative self-join with temp tables (loop outside Hive SQL)

You run these statements in a loop (in script or application):

Step 1 — Select root nodes

INSERT OVERWRITE TABLE hierarchy_level0
SELECT id, parent_id, name, CAST(name AS string) AS path
FROM org
WHERE parent_id IS NULL;

Step 2 — Iteratively find children

Loop:

INSERT OVERWRITE TABLE hierarchy_levelN
SELECT c.id, c.parent_id, c.name,
       CONCAT(p.path, ' > ', c.name) AS path
FROM org c
JOIN hierarchy_level(N-1) p
  ON c.parent_id = p.id;

Stop when no new rows are produced.

This simulates recursive CTE.


✅ Scenario 3 — Use Hive UDF/UDTF (works when hierarchy stored as paths)

If the table already has path string or list of ancestors, you can use split(), explode().

Example:

/A/B/C  →  [A, B, C]

Query:

SELECT
  id,
  parent_id,
  name,
  size(split(path_str, '/')) AS depth
FROM org_paths;

 


🔥 MOST COMMON PRACTICAL SOLUTION IN INDUSTRY

If you need true tree traversal, and you run on Hadoop/Yarn, teams usually switch to:

Spark SQL (supports recursive CTE since Spark 3.4+):

WITH RECURSIVE t AS (
  SELECT id, parent_id, name FROM org WHERE parent_id IS NULL
  UNION ALL
  SELECT c.id, c.parent_id, c.name
  FROM org c
  JOIN t p ON c.parent_id = p.id
)
SELECT * FROM t;

This runs directly and is simplest.


If you want, tell me:

  1. Your table structure

  2. Sample data

  3. Desired output

I’ll write fully working Hive SQL for your hierarchy query.

 
You said:
Does Hive SQL have a structure to temporarily store a list of values without creating a table?
 
ChatGPT said:

Short answer: Hive SQL does not have variables or temporary in-memory lists like PostgreSQL arrays or MySQL session variables.
But yes, Hive has structures inside a query that can temporarily hold lists without creating tables.

Here are all the options you can use.


✅ 1. Use array(), map(), struct() — stored only inside the query

These are row-level data structures, not tables.

Example:

SELECT array(1, 2, 3, 4) AS my_list;

Or:

SELECT map('a', 1, 'b', 2);

These are temporary values inside the SQL statement; no table is created.


✅ 2. Use INLINE or VALUES to create a temporary in-query table

Sometimes you just want a list of values as a virtual table:

Option A — VALUES

SELECT * FROM (
  VALUES (1), (2), (3)
) AS t(id);

Option B — inline()

SELECT x
FROM inline(array(
    struct(1),
    struct(2),
    struct(3)
)) AS t(x);

This effectively creates a list in memory only for the query.


✅ 3. Use explode() to expand arrays into rows

SELECT explode(array(10, 20, 30)) AS value;

 

Produces:

value
10
20
30

❌ What Hive does not have

Hive does not support:

  • Temporary variables like SET @x = ...

  • Temporary in-memory structures shared across multiple statements

  • A structure that persists beyond a single SQL statement (unless you create a table)


🧩 Summary

NeedHive Supports?How
Store list inside a query ✔ Yes array(), inline(), VALUES
Create a temporary virtual table ✔ Yes VALUES or inline()
Persist list across queries without a table ❌ No Must create a table
SQL variables ❌ No Hive does not support SQL variables

If you tell me what you want to do with the list (filter a table? join? loop?), I can show you the cleanest Hive solution.

 
You said:
I want to store a column of values selected from a table.
 
ChatGPT said:

In Hive SQL, you cannot store a column of values into a variable (Hive has no arrays/variables outside a single SELECT), but you can keep the list inside your query using ARRAY or VALUES without creating a table.

Below are the 3 practical ways to “store” a column of values from a table without creating any table.


Method 1 — Convert the column into an ARRAY (inside a single query)

If you only need the list inside the current query, use collect_list():

WITH t AS (
  SELECT collect_list(id) AS id_list
  FROM my_table
)
SELECT id_list FROM t;

This gives:

[1, 5, 7, 10, ...]

Now you can use it immediately:

WITH tmp AS (
  SELECT collect_list(id) AS ids
  FROM my_table
)
SELECT *
FROM other_table
WHERE other_table.id IN (SELECT explode(ids) FROM tmp);

 

✔ Works inside one query
✔ No table created
✔ Common solution for “store list temporarily”


Method 2 — Use a subquery as a temporary virtual table

If you want to reuse the list inside the same SQL:

SELECT *
FROM target t
JOIN (
    SELECT id
    FROM my_table
) AS v
ON t.id = v.id;

This creates an in-memory result set only for the duration of the query.


Method 3 — Use VALUES with dynamic SQL (if list is small)

You can generate a VALUES block:

SELECT *
FROM target t
JOIN (
  SELECT id
  FROM (
    VALUES (1), (5), (7), (10)
  ) AS x(id)
) v
ON t.id = v.id;

But this requires generating the VALUES list outside Hive.


❌ What Hive cannot do

Hive does not allow:

  • Storing a list in a session variable

  • Reusing a list across statements without a table

  • Creating temporary non-persistent tables like PostgreSQL TEMP TABLE (Hive temp tables are still HDFS tables)


✅ Recommended Best Practice

If your use case is:

“I want to use a list of values later in the same SQL.”

→ Use collect_list() + explode().

Example:

WITH ids AS (
  SELECT collect_list(user_id) AS id_list
  FROM users
)
SELECT *
FROM log_table
WHERE uid IN (SELECT explode(id_list) FROM ids);

 

posted on 2025-11-17 09:09  ZhangZhihuiAAA  阅读(0)  评论(0)    收藏  举报