公告

日历

Windowing functions
- LEAD
- LAG
- FIRST_VALUE
- LAST_VALUE
The OVER clause
- OVER with standard aggregates:
  - COUNT
  - SUM
  - MIN
  - MAX
  - AVG
- OVER with a PARTITION BY statement with one or more partitioning columns of any primitive datatype.
- OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns of any datatype.
Analytics functions
- RANK
- ROW_NUMBER
- DENSE_RANK
- CUME_DIST
- PERCENT_RANK
- NTILE
Distinct support in Hive 2.1.0 and later (see HIVE-9534)

COUNT(DISTINCT a) OVER (PARTITION BY c)

COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

Aggregate functions in OVER clause support in Hive 2.1.0 and later (see HIVE-13475)

SELECT rank() OVER (ORDER BY sum(b))
FROM T
GROUP BY a;

In Hive (and standard SQL), windowing functions allow you to perform calculations across a set of rows related to the current row.

To illustrate these, let's assume we have a table called sales_data:

employee	department	sales
Alice	Sales	5000
Bob	Sales	5000
Charlie	Sales	4000
David	Sales	3000

1. ROW_NUMBER, RANK, and DENSE_RANK

These are the most common ranking functions. They behave differently when they encounter "ties" (identical values).

SELECT 
    employee, sales,
    ROW_NUMBER() OVER (ORDER BY sales DESC) as row_num,
    RANK()       OVER (ORDER BY sales DESC) as rank_val,
    DENSE_RANK() OVER (ORDER BY sales DESC) as dense_rank_val
FROM sales_data;

Result:

|employee |sales |row_num |rank_val |dense_rank_val |Note                        |
|---------|------|--------|---------|---------------|----------------------------|
|Alice    |5000  |1       |1        |1              |                            |
|Bob      |5000  |2       |1        |1              |Tied                        |
|Charlie  |4000  |3       |3        |2              |Rank skips 2; Dense doesn't |
|David    |3000  |4       |4        |3              |                            |

ROW_NUMBER: Assigns a unique, sequential number (1, 2, 3, 4).
RANK: Ties get the same rank, but the next number is skipped (1, 1, 3).
DENSE_RANK: Ties get the same rank, and the next number is the immediate successor (1, 1, 2).

2. CUME_DIST and PERCENT_RANK

These functions are used for statistical distribution.

CUME_DIST (Cumulative Distribution): Calculates the relative rank of a value: $(rows \le current\_row) / total\_rows$
PERCENT_RANK: Calculates the percentile ranking: $(rank - 1) / (total\_rows - 1)$

SELECT 
    employee, sales,
    CUME_DIST()    OVER (ORDER BY sales DESC) as cume_dist_val,
    PERCENT_RANK() OVER (ORDER BY sales DESC) as percent_rank_val
FROM sales_data;

Result:

|employee |sales |cume_dist_val |percent_rank_val |
|---------|------|--------------|-----------------|
|Alice    |5000  |0.50          |0.00             |
|Bob      |5000  |0.50          |0.00             |
|Charlie  |4000  |0.75          |0.66             |
|David    |3000  |1.00          |1.00             |

3. NTILE

NTILE(n) divides the result set into n buckets (groups) and assigns a bucket number to each row. This is very useful for finding quartiles or quintiles.

-- Dividing our 4 employees into 2 buckets
SELECT 
    employee, sales,
    NTILE(2) OVER (ORDER BY sales DESC) as bucket
FROM sales_data;

Result:

|employee |sales |bucket |
|---------|------|-------|
|Alice    |5000  |1      |
|Bob      |5000  |1      |
|Charlie  |4000  |2      |
|David    |3000  |2      |

Key Syntax Note: PARTITION BY

In all the examples above, we used OVER (ORDER BY ...). If you want these calculations to restart for every department, you would add PARTITION BY department:

SELECT 
    employee, department, sales,
    RANK() OVER (PARTITION BY department ORDER BY sales DESC) as dept_rank
FROM sales_data;

posted on 2026-01-15 10:31 ZhangZhihuiAAA 阅读(3) 评论(0) 收藏举报

刷新页面返回顶部


博客园 © 2004-2026 浙公网安备 33010602011771号浙ICP备2021040463号-3

导航

1. ROW_NUMBER, RANK, and DENSE_RANK

2. CUME_DIST and PERCENT_RANK

3. NTILE

Key Syntax Note: PARTITION BY