ZhangZhihui's Blog  

https://hive.apache.org/docs/latest/language/languagemanual-windowingandanalytics/

 

  1. Windowing functions
    • LEAD
    • LAG
    • FIRST_VALUE
    • LAST_VALUE
  2. The OVER clause
    • OVER with standard aggregates:

      • COUNT
      • SUM
      • MIN
      • MAX
      • AVG
    • OVER with a PARTITION BY statement with one or more partitioning columns of any primitive datatype.

    • OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns of any datatype.

  3. Analytics functions
    • RANK
    • ROW_NUMBER
    • DENSE_RANK
    • CUME_DIST
    • PERCENT_RANK
    • NTILE
  4. Distinct support in Hive 2.1.0 and later (see HIVE-9534)
COUNT(DISTINCT a) OVER (PARTITION BY c)
COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
  1. Aggregate functions in OVER clause support in Hive 2.1.0 and later (see HIVE-13475)
SELECT rank() OVER (ORDER BY sum(b))
FROM T
GROUP BY a;

 

In Hive (and standard SQL), windowing functions allow you to perform calculations across a set of rows related to the current row.

To illustrate these, let's assume we have a table called sales_data:

employee department sales
Alice Sales 5000
Bob Sales 5000
Charlie Sales 4000
David Sales 3000

1. ROW_NUMBER, RANK, and DENSE_RANK

These are the most common ranking functions. They behave differently when they encounter "ties" (identical values).

SELECT 
    employee, sales,
    ROW_NUMBER() OVER (ORDER BY sales DESC) as row_num,
    RANK()       OVER (ORDER BY sales DESC) as rank_val,
    DENSE_RANK() OVER (ORDER BY sales DESC) as dense_rank_val
FROM sales_data;

Result:

|employee |sales |row_num |rank_val |dense_rank_val |Note                        |
|---------|------|--------|---------|---------------|----------------------------|
|Alice    |5000  |1       |1        |1              |                            |
|Bob      |5000  |2       |1        |1              |Tied                        |
|Charlie  |4000  |3       |3        |2              |Rank skips 2; Dense doesn't |
|David    |3000  |4       |4        |3              |                            |
  • ROW_NUMBER: Assigns a unique, sequential number (1, 2, 3, 4).

  • RANK: Ties get the same rank, but the next number is skipped (1, 1, 3).

  • DENSE_RANK: Ties get the same rank, and the next number is the immediate successor (1, 1, 2).


2. CUME_DIST and PERCENT_RANK

These functions are used for statistical distribution.

  • CUME_DIST (Cumulative Distribution): Calculates the relative rank of a value: (rows <= current_row) / total_rows.

  • PERCENT_RANK: Calculates the percentile ranking: (rank - 1) / (total_rows - 1).

SELECT 
    employee, sales,
    CUME_DIST()    OVER (ORDER BY sales DESC) as cume_dist_val,
    PERCENT_RANK() OVER (ORDER BY sales DESC) as percent_rank_val
FROM sales_data;

Result:

|employee |sales |cume_dist_val |percent_rank_val |
|---------|------|--------------|-----------------|
|Alice    |5000  |0.50          |0.00             |
|Bob      |5000  |0.50          |0.00             |
|Charlie  |4000  |0.75          |0.66             |
|David    |3000  |1.00          |1.00             |

3. NTILE

NTILE(n) divides the result set into n buckets (groups) and assigns a bucket number to each row. This is very useful for finding quartiles or quintiles.

-- Dividing our 4 employees into 2 buckets
SELECT 
    employee, sales,
    NTILE(2) OVER (ORDER BY sales DESC) as bucket
FROM sales_data;

Result:

|employee |sales |bucket |
|---------|------|-------|
|Alice    |5000  |1      |
|Bob      |5000  |1      |
|Charlie  |4000  |2      |
|David    |3000  |2      |

Key Syntax Note: PARTITION BY

In all the examples above, we used OVER (ORDER BY ...). If you want these calculations to restart for every department, you would add PARTITION BY department:

SELECT 
    employee, department, sales,
    RANK() OVER (PARTITION BY department ORDER BY sales DESC) as dept_rank
FROM sales_data;

 

posted on 2026-01-15 10:31  ZhangZhihuiAAA  阅读(3)  评论(0)    收藏  举报