https://hive.apache.org/docs/latest/language/languagemanual-windowingandanalytics/
- Windowing functions
- LEAD
- LAG
- FIRST_VALUE
- LAST_VALUE
- The OVER clause
-
OVER with standard aggregates:
- COUNT
- SUM
- MIN
- MAX
- AVG
-
OVER with a PARTITION BY statement with one or more partitioning columns of any primitive datatype.
-
OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns of any datatype.
-
- Analytics functions
- RANK
- ROW_NUMBER
- DENSE_RANK
- CUME_DIST
- PERCENT_RANK
- NTILE
- Distinct support in Hive 2.1.0 and later (see HIVE-9534)
COUNT(DISTINCT a) OVER (PARTITION BY c)
COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
- Aggregate functions in OVER clause support in Hive 2.1.0 and later (see HIVE-13475)
SELECT rank() OVER (ORDER BY sum(b))
FROM T
GROUP BY a;
In Hive (and standard SQL), windowing functions allow you to perform calculations across a set of rows related to the current row.
To illustrate these, let's assume we have a table called sales_data:
| employee | department | sales |
| Alice | Sales | 5000 |
| Bob | Sales | 5000 |
| Charlie | Sales | 4000 |
| David | Sales | 3000 |
1. ROW_NUMBER, RANK, and DENSE_RANK
These are the most common ranking functions. They behave differently when they encounter "ties" (identical values).
SELECT
employee, sales,
ROW_NUMBER() OVER (ORDER BY sales DESC) as row_num,
RANK() OVER (ORDER BY sales DESC) as rank_val,
DENSE_RANK() OVER (ORDER BY sales DESC) as dense_rank_val
FROM sales_data;
Result:
|employee |sales |row_num |rank_val |dense_rank_val |Note | |---------|------|--------|---------|---------------|----------------------------| |Alice |5000 |1 |1 |1 | | |Bob |5000 |2 |1 |1 |Tied | |Charlie |4000 |3 |3 |2 |Rank skips 2; Dense doesn't | |David |3000 |4 |4 |3 | |
-
ROW_NUMBER: Assigns a unique, sequential number (1, 2, 3, 4).
-
RANK: Ties get the same rank, but the next number is skipped (1, 1, 3).
-
DENSE_RANK: Ties get the same rank, and the next number is the immediate successor (1, 1, 2).
2. CUME_DIST and PERCENT_RANK
These functions are used for statistical distribution.
-
CUME_DIST (Cumulative Distribution): Calculates the relative rank of a value: (rows <= current_row) / total_rows.
-
PERCENT_RANK: Calculates the percentile ranking: (rank - 1) / (total_rows - 1).
SELECT
employee, sales,
CUME_DIST() OVER (ORDER BY sales DESC) as cume_dist_val,
PERCENT_RANK() OVER (ORDER BY sales DESC) as percent_rank_val
FROM sales_data;
Result:
|employee |sales |cume_dist_val |percent_rank_val | |---------|------|--------------|-----------------| |Alice |5000 |0.50 |0.00 | |Bob |5000 |0.50 |0.00 | |Charlie |4000 |0.75 |0.66 | |David |3000 |1.00 |1.00 |
3. NTILE
NTILE(n) divides the result set into n buckets (groups) and assigns a bucket number to each row. This is very useful for finding quartiles or quintiles.
-- Dividing our 4 employees into 2 buckets
SELECT
employee, sales,
NTILE(2) OVER (ORDER BY sales DESC) as bucket
FROM sales_data;
Result:
|employee |sales |bucket | |---------|------|-------| |Alice |5000 |1 | |Bob |5000 |1 | |Charlie |4000 |2 | |David |3000 |2 |
Key Syntax Note: PARTITION BY
In all the examples above, we used OVER (ORDER BY ...). If you want these calculations to restart for every department, you would add PARTITION BY department:
SELECT
employee, department, sales,
RANK() OVER (PARTITION BY department ORDER BY sales DESC) as dept_rank
FROM sales_data;

浙公网安备 33010602011771号