Spark SQL 窗函数的使用

窗口函数

SQL窗口函数是SQL中的一种高级函数,它允许用户在不显式分组查询的情况下对结果集进行分组和聚合计算。

窗口函数的特别之处在于,它们将结果集中的每一行看作一个单独的计算对象,而不是将结果集划分为分组并计算每个分组的聚合值。这

就使得窗口函数能够为结果集中的每一行计算类似排名、行号、百分比和移动聚合函数等值。

和GROUP BY聚合函数的核心区别是:分组不折叠,聚合不丢失明细

  1. 聚合函数 (sum/count/max):将分组内多行数据合并为一行,丢失原行的明细数据;
  2. 窗口函数:对指定分组(窗口)内的数据做聚合 / 排序 / 偏移计算,计算结果作为新列追加到原表每一行,不会减少原表行数、保留所有明细数据。
窗口函数(参数) OVER (
    [PARTITION BY 分组字段1, 分组字段2, ...]  -- 可选,窗口分组条件,类似GROUP BY,无则全局1个窗口
    [ORDER BY 排序字段1 [ASC|DESC], 排序字段2 [ASC|DESC], ...]  -- 可选,窗口内数据排序,部分函数必填
    [ROWS | RANGE BETWEEN 边界条件1 AND 边界条件2]  -- 可选,窗口范围限定(行范围/值范围)
) AS 别名

其中:

  1. <窗口函数>: 定义要在窗口中计算的聚合函数或其它分析函数,如COUNT、RANK、SUM等。

  2. OVER : 窗口函数的核心关键字。

  3. PARTITION BY : 定义要用来分组的一组列名。

  4. ORDER BY : 定义用来排序的一组列名。

  5. <rows or range clause>: 窗口范围边界条件,定义窗口的行集合。默认为 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ,表示窗口包括从窗口开始到当前行的所有行。
    窗口范围默认值:ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(从分组第一行到当前行),以下是最常用的边界条件:

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING:分组内所有行(全局窗口)
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING:当前行 + 分组内后续所有行
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING:当前行 + 上 1 行 + 下 1 行(滑动窗口,共 3 行)
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW:当前行 + 前 3 行(近 4 行滑动统计)
ROWS BETWEEN 0 PRECEDING AND 0 FOLLOWING:仅当前行

sql窗口函数分类

① 聚合类:SUM、COUNT、AVG、MAX、MIN、FIRST_VALUE、LAST_VALUE(最常用)

② 排名类:ROW_NUMBER()、RANK()、DENSE_RANK()

③ 偏移类:LAG()、LEAD()、NTH_VALUE()(同比 / 环比)

④ 分布类:CUME_DIST()、PERCENT_RANK()、NTILE(n)

⑤ 窗口范围:求移动平均

聚合类窗口函数

普通的聚合函数,如:SUM()、COUNT()、AVG()、MAX()、MIN()、FIRST_VALUE()、LAST_VALUE()

普通聚合函数直接复用为窗口函数,是开发中使用频率最高的窗口函数;

ORDER BY时,分组内所有行返回相同的聚合值;有ORDER BY时,默认是累计聚合(从分组首行到当前行);

一般用来实现「分组统计、累计求和、滑动均值、分组极值」等功能。

案例1:基础聚合-按员工分组,统计每个员工的总收款、平均收款、收款笔数、最高单笔收款、最低单笔收款

SELECT 
    p.staff_id, s.first_name || ' ' || s.last_name as staff_name, p.payment_id, p.amount, p.payment_date,
    SUM(p.amount) OVER(PARTITION BY p.staff_id) AS staff_total_amount, -- 员工总收款
    AVG(p.amount) OVER(PARTITION BY p.staff_id) AS staff_avg_amount,   -- 员工平均收款
    COUNT(*) OVER(PARTITION BY p.staff_id) AS staff_payment_count,    -- 员工收款笔数
    MAX(p.amount) OVER(PARTITION BY p.staff_id) AS staff_max_amount,   -- 员工最高单笔收款
    MIN(p.amount) OVER(PARTITION BY p.staff_id) AS staff_min_amount    -- 员工最低单笔收款
FROM payment p
LEFT JOIN staff s ON p.staff_id = s.staff_id
LIMIT 10;

案例2:累计聚合-按客户分组,按支付时间升序,统计客户的累计支付金额(核心业务:客户消费流水累计)

SELECT 
    c.customer_id, c.first_name || ' ' || c.last_name as customer_name, p.amount, p.payment_date,
    SUM(p.amount) OVER(PARTITION BY p.customer_id ORDER BY p.payment_date) AS cumulative_pay,
    COUNT(*) OVER(PARTITION BY p.customer_id ORDER BY p.payment_date) AS cumulative_count
FROM payment p
LEFT JOIN customer c ON p.customer_id = c.customer_id
WHERE c.customer_id = 1 -- 筛选单个客户便于查看累计效果
ORDER BY p.payment_date;

案例3:极值取值-按影片分类分组,按租金降序,取分类内最贵/最便宜的影片租金

SELECT 
    f.film_id, f.title, c.name as cate_name, f.rental_rate,
    FIRST_VALUE(f.rental_rate) OVER(PARTITION BY c.name ORDER BY f.rental_rate DESC) AS cate_max_rent,
    LAST_VALUE(f.rental_rate) OVER(PARTITION BY c.name ORDER BY f.rental_rate DESC 
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS cate_min_rent
FROM film f
LEFT JOIN film_category fc ON f.film_id = fc.film_id
LEFT JOIN category c ON fc.category_id = c.category_id;

排名类窗口函数

核心规则 & 语法

  1. 专门用于分组内排序生成排名序号必须搭配 ORDER BY 子句,无 ORDER BY 会直接报错;

    排名函数() OVER(PARTITION BY 分组字段 ORDER BY 排序字段 [ASC|DESC]) AS 排名别名
    
  2. 三者核心差异

    • ROW_NUMBER()连续无重复排名,同分值也会生成连续序号,无并列。例:2 个相同租金,排名为 1、2;
    • RANK()跳跃式并列排名,同分值并列同排名,下一名跳过并列数。例:2 个并列第 1,下一个直接是第 3;
    • DENSE_RANK()连续式并列排名,同分值并列同排名,下一名连续不跳跃。例:2 个并列第 1,下一个是第 2。
-- 案例1:按影片分类分组,按租金降序,生成3种排名(核心:对比三者差异)
SELECT 
    f.title, c.name as cate_name, f.rental_rate,
    ROW_NUMBER() OVER(PARTITION BY c.name ORDER BY f.rental_rate DESC) AS rn,
    RANK() OVER(PARTITION BY c.name ORDER BY f.rental_rate DESC) AS rk,
    DENSE_RANK() OVER(PARTITION BY c.name ORDER BY f.rental_rate DESC) AS drk
FROM film f
LEFT JOIN film_category fc ON f.film_id = fc.film_id
LEFT JOIN category c ON fc.category_id = c.category_id
WHERE c.name IN ('Action','Comedy') -- 筛选2个分类便于查看
LIMIT 20;

-- 案例2:业务高频TOP N - 取每个员工收款金额TOP5的客户(ROW_NUMBER经典用法)
SELECT * FROM (
    SELECT 
        p.staff_id, s.staff_name, p.customer_id, c.customer_name,
        SUM(p.amount) AS cust_total_pay,
        ROW_NUMBER() OVER(PARTITION BY p.staff_id ORDER BY SUM(p.amount) DESC) AS pay_rank
    FROM payment p
    LEFT JOIN (SELECT staff_id, first_name||' '||last_name as staff_name FROM staff) s 
    		ON p.staff_id = s.staff_id
    LEFT JOIN (SELECT customer_id, first_name||' '||last_name as customer_name FROM customer) c 
    		ON p.customer_id = c.customer_id
    GROUP BY p.staff_id, s.staff_name, p.customer_id, c.customer_name
) t WHERE pay_rank <=5; -- 筛选TOP5

偏移类窗口函数

LAG()、LEAD()、NTH_VALUE() 以及FIRST_VALUE()、LAST_VALUE()

分组内按排序规则,获取「当前行的上 N 行、下 N 行、指定第 N 行」的字段值,不改变原表行数,仅新增列。必须搭配 ORDER BY,是实现同比、环比、跨行对比、前后数据溯源的核心函数。

LAG (col, n, default_val) - 向前偏移取值

语法:LAG(字段名, 偏移行数, 无数据默认值) OVER(窗口子句)

作用:取「当前行的上 n 行」的指定字段值,n=1 表示上一行,是最常用的偏移值

无数据时默认返回 NULL,可自定义默认值(如 0)

-- 案例:按客户分组,按支付时间升序,取客户上一笔支付的金额和时间(核心:客户消费流水环比)
SELECT 
    customer_id, payment_date, amount,
    LAG(amount,1,0) OVER(PARTITION BY customer_id ORDER BY payment_date) AS prev_pay_amount,
    LAG(payment_date,1) OVER(PARTITION BY customer_id ORDER BY payment_date) AS prev_pay_date
FROM payment
WHERE customer_id = 2 -- 筛选单个客户便于查看
ORDER BY payment_date;

LEAD (col, n, default_val) - 向后偏移取值

语法:和 LAG 完全一致,参数含义相同

作用:取「当前行的下 n 行」的指定字段值,和 LAG 是反向操作

-- 案例:按影片分类分组,按评分降序,取当前影片的下一部高分影片名称和评分
SELECT 
    c.name as cate_name, f.title, f.rating, f.rental_rate,
    LEAD(f.title,1,'无') OVER(PARTITION BY c.name ORDER BY f.rating DESC) AS next_film,
    LEAD(f.rating,1,0) OVER(PARTITION BY c.name ORDER BY f.rating DESC) AS next_rating
FROM film f
LEFT JOIN film_category fc ON f.film_id = fc.film_id
LEFT JOIN category c ON fc.category_id = c.category_id
WHERE c.name = 'Drama';

NTH_VALUE (col, n) - 指定行取值

语法:NTH_VALUE(字段名, 行号) OVER(窗口子句)

作用:取分组内排序后「第 n 行」的指定字段值,可灵活获取任意位置的数据

-- 案例:按影片分类分组,按租金降序,取分类内租金排名第3的影片租金
SELECT 
    c.name as cate_name, f.title, f.rental_rate,
    NTH_VALUE(f.rental_rate,3) OVER(PARTITION BY c.name ORDER BY f.rental_rate DESC 
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS third_rent
FROM film f
LEFT JOIN film_category fc ON f.film_id = fc.film_id
LEFT JOIN category c ON fc.category_id = c.category_id;

分布 / 序号类窗口函数

CUME_DIST()、PERCENT_RANK()、NTILE(n)、ROW_NUMBER()

用于生成分组内的分布占比、百分比排名、数据分段编号,属于进阶分析函数,「客户价值分层、影片热度分级、收款业绩分布」等场景

CUME_DIST () - 累计分布值

语法:CUME_DIST() OVER(窗口子句)

作用:返回[0,1]区间的小数,表示「当前行的位置 / 分组总行数」,值越大排名越靠后,相同值返回相同分布值

-- 案例:按员工分组,按收款金额降序,统计员工收款金额的累计分布(占比)
SELECT 
    staff_id, amount, payment_date,
    CUME_DIST() OVER(PARTITION BY staff_id ORDER BY amount DESC) AS dist_ratio
FROM payment
LIMIT 15;

PERCENT_RANK () - 百分比排名

语法:PERCENT_RANK() OVER(窗口子句)

作用:返回[0,1]区间的小数,公式:(当前RANK值 - 1) / (分组总行数 - 1),衡量当前行在分组中的相对位置

-- 案例:按影片分类分组,按租金降序,统计影片租金的百分比排名
SELECT 
    c.name as cate_name, f.title, f.rental_rate,
    PERCENT_RANK() OVER(PARTITION BY c.name ORDER BY f.rental_rate DESC) AS percent_rank_val
FROM film f
LEFT JOIN film_category fc ON f.film_id = fc.film_id
LEFT JOIN category c ON fc.category_id = c.category_id;

NTILE (n) - 数据分段 / 分桶

语法:NTILE(分段数) OVER(窗口子句)

作用:将分组内的数据均匀切分为 n 个桶,返回每行所属的桶编号(从 1 开始),数据量不均时,前几个桶多 1 行

核心场景:TOP N%、数据分层、客户分级(如高 / 中 / 低价值客户)

-- 案例:按客户总支付金额降序,将所有客户分为3层(高/中/低价值)
SELECT 
    customer_id, total_pay,
    NTILE(3) OVER(ORDER BY total_pay DESC) AS customer_level -- 1=高价值,2=中价值,3=低价值
FROM (
    SELECT customer_id, SUM(amount) AS total_pay FROM payment GROUP BY customer_id
) t;

窗口范围边界 - 移动平均

时间移动平均是一种常用的数据平滑方法,用于过滤掉数据的短期波动,突出数据的长期趋势。其实就是基于一定的时间窗口内的数据进

行平均处理。例如:5日移动平均、10日移动平均、30日移动平均、3月移动平均等。

举个例子来说明,若时间窗口的大小为 3 个月,那么我们可以按照以下步骤计算3月移动平均。

-- 按日期统计每日收入
SELECT 
    DATE(payment_date) as payment_day,
    SUM(amount) as daily_revenue
FROM payment
GROUP BY DATE(payment_date)
ORDER BY payment_day;

7日移动平均:

WITH daily_revenue AS (
    SELECT 
        DATE(payment_date) as payment_day,
        SUM(amount) as daily_revenue
    FROM payment
    GROUP BY DATE(payment_date)
)
SELECT 
    payment_day,
    daily_revenue,
    ROUND(AVG(daily_revenue) OVER (
        ORDER BY payment_day
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 2) as revenue_7day_ma
FROM daily_revenue
ORDER BY payment_day;

每月收入移动平均

-- 按月份统计收入
SELECT 
    DATE_TRUNC('month', payment_date) as payment_month,
    SUM(amount) as monthly_revenue
FROM payment
GROUP BY DATE_TRUNC('month', payment_date)
ORDER BY payment_month;

-- 3个月移动平均:
WITH monthly_revenue AS (
    SELECT 
        DATE_TRUNC('month', payment_date) as payment_month,
        SUM(amount) as monthly_revenue
    FROM payment
    GROUP BY DATE_TRUNC('month', payment_date)
)
SELECT 
    payment_month,
    monthly_revenue,
    ROUND(AVG(monthly_revenue) OVER (
        ORDER BY payment_month
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 2) as revenue_3month_ma
FROM monthly_revenue
ORDER BY payment_month;

多维度移动平均

-- 按店铺和月份的移动平均
WITH store_monthly AS (
    SELECT 
        s.store_id,
        DATE_TRUNC('month', p.payment_date) as payment_month,
        SUM(p.amount) as monthly_revenue
    FROM payment p
    JOIN rental r ON p.rental_id = r.rental_id
    JOIN staff s ON p.staff_id = s.staff_id
    GROUP BY s.store_id, DATE_TRUNC('month', p.payment_date)
)
SELECT 
    store_id,
    payment_month,
    monthly_revenue,
    ROUND(AVG(monthly_revenue) OVER (
        PARTITION BY store_id
        ORDER BY payment_month
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 2) as store_3month_ma
FROM store_monthly
ORDER BY store_id, payment_month;

中心移动平均

-- 中心移动平均(前后各取1天)
WITH daily_revenue AS (
    SELECT 
        DATE(payment_date) as payment_day,
        SUM(amount) as daily_revenue
    FROM payment
    GROUP BY DATE(payment_date)
)
SELECT 
    payment_day,
    daily_revenue,
    ROUND(AVG(daily_revenue) OVER (
        ORDER BY payment_day
        ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
    ), 2) as revenue_centered_ma
FROM daily_revenue
ORDER BY payment_day;

移动平均与原始数据对比分析

WITH daily_data AS (
    SELECT 
        DATE(payment_date) as payment_day,
        COUNT(*) as daily_transactions,
        SUM(amount) as daily_revenue
    FROM payment
    GROUP BY DATE(payment_date)
)
SELECT 
    payment_day,
    daily_transactions,
    daily_revenue,
    ROUND(AVG(daily_revenue) OVER (
        ORDER BY payment_day
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 2) as revenue_7day_ma,
    ROUND(AVG(daily_transactions) OVER (
        ORDER BY payment_day
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 1) as transactions_7day_ma,
    -- 计算移动平均与当日值的比率
    ROUND((daily_revenue / NULLIF(AVG(daily_revenue) OVER (
        ORDER BY payment_day
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 0) - 1) * 100, 2) as revenue_vs_ma_percent
FROM daily_data
ORDER BY payment_day;

Spark SQL实现

Spark SQL 对窗口函数的 Java API 封装完全贴合原生 SQL 语法,核心映射关系:

  • SQL PARTITION BY col1,col2 → Java Window.partitionBy(col("col1"), col("col2"))
  • SQL ORDER BY col DESC → Java .orderBy(col("col").desc())
  • SQL ROWS BETWEEN A AND B → Java .rowsBetween(Window.xxx(), Window.xxx())
  • 所有窗口函数都在org.apache.spark.sql.functions工具类中,静态调用即可
  • 窗口规则最终构建为WindowSpec对象,传入所有窗口函数的.over()方法

聚合类窗口函数

对应上面「聚合类」SQL 案例,实现员工收款统计 + 客户累计支付

// 窗口规则1:按员工分组
WindowSpec staffWindow = Window.partitionBy(col("staff_id"));
// 员工收款聚合统计
Dataset<Row> staffAggDF = paymentDF.select(
    col("staff_id"), col("payment_id"), col("amount"), col("payment_date"),
    sum(col("amount")).over(staffWindow).alias("staff_total_amount"),
    avg(col("amount")).over(staffWindow).alias("staff_avg_amount"),
    count(col("*")).over(staffWindow).alias("staff_payment_count"),
    max(col("amount")).over(staffWindow).alias("staff_max_amount"),
    min(col("amount")).over(staffWindow).alias("staff_min_amount")
);
staffAggDF.show(10);

// 窗口规则2:按客户分组,支付时间升序
WindowSpec custCumWindow = Window.partitionBy(col("customer_id")).orderBy(col("payment_date"));
// 客户累计支付统计
Dataset<Row> custCumDF = paymentDF.select(
    col("customer_id"), col("amount"), col("payment_date"),
    sum(col("amount")).over(custCumWindow).alias("cumulative_pay"),
    count(col("*")).over(custCumWindow).alias("cumulative_count")
);
custCumDF.filter(col("customer_id").equalTo(1)).show();

排名函数

实现三大排名函数 + 员工收款 TOP5 客户完全一致的业务逻辑

// 窗口规则:按影片分类分组(film表关联后),租金降序
WindowSpec filmRankWindow = Window.partitionBy(col("category_id")).orderBy(col("rental_rate").desc());
// 三大排名函数实现
Dataset<Row> filmRankDF = filmDF.select(
    col("film_id"), col("title"), col("category_id"), col("rental_rate"),
    row_number().over(filmRankWindow).alias("rn"),
    rank().over(filmRankWindow).alias("rk"),
    dense_rank().over(filmRankWindow).alias("drk")
);
filmRankDF.show(20);

// 业务高频:员工收款TOP5客户(子查询+ROW_NUMBER筛选)
Dataset<Row> custPayDF = paymentDF.groupBy(col("staff_id"), col("customer_id"))
        .agg(sum(col("amount")).alias("cust_total_pay"));
WindowSpec top5Window = Window.partitionBy(col("staff_id")).orderBy(col("cust_total_pay").desc());
Dataset<Row> top5CustDF = custPayDF.select(
    col("*"), row_number().over(top5Window).alias("pay_rank")
).filter(col("pay_rank").leq(5)); // 筛选TOP5
top5CustDF.show();

偏移类窗口函数

对应上面「偏移类」SQL 案例,实现LAG 向前取值 + LEAD 向后取值,核心的环比 / 溯源业务:

// 窗口规则:按客户分组,支付时间升序
WindowSpec offsetWindow = Window.partitionBy(col("customer_id")).orderBy(col("payment_date"));
// LAG+LEAD 偏移取值实现
Dataset<Row> offsetDF = paymentDF.select(
    col("customer_id"), col("payment_date"), col("amount"),
    lag(col("amount"),1,0).over(offsetWindow).alias("prev_pay_amount"),
    lag(col("payment_date"),1).over(offsetWindow).alias("prev_pay_date"),
    lead(col("amount"),1,0).over(offsetWindow).alias("next_pay_amount")
);
offsetDF.filter(col("customer_id").equalTo(2)).show();

分布 / 序号类窗口函数

对应上面「分布类」SQL 案例,实现CUME_DIST+PERCENT_RANK+NTILE 客户分层

// 窗口规则:按员工分组,收款金额降序
WindowSpec distWindow = Window.partitionBy(col("staff_id")).orderBy(col("amount").desc());
Dataset<Row> distDF = paymentDF.select(
    col("staff_id"), col("amount"),
    cume_dist().over(distWindow).alias("dist_ratio"),
    percent_rank().over(distWindow).alias("percent_rank_val")
);
distDF.show(15);

// 客户价值分层:按总支付金额降序,分3层
Dataset<Row> custTotalDF = paymentDF.groupBy(col("customer_id")).agg(sum(col("amount")).alias("total_pay"));
WindowSpec ntileWindow = Window.orderBy(col("total_pay").desc());
Dataset<Row> custLevelDF = custTotalDF.select(
    col("*"), ntile(3).over(ntileWindow).alias("customer_level")
);
custLevelDF.show();

范围限定窗口函数

对应 SQL 中的ROWS BETWEEN,实现近 3 笔收款的滑动求和

// 窗口规则:按客户分组,支付时间升序,滑动范围:当前行+前2行(共3行)
WindowSpec slideWindow = Window.partitionBy(col("customer_id"))
        .orderBy(col("payment_date"))
        .rowsBetween(Window.preceding(2), Window.currentRow());
Dataset<Row> slideDF = paymentDF.select(
    col("customer_id"), col("payment_date"), col("amount"),
    sum(col("amount")).over(slideWindow).alias("slide_3_sum")
);
slideDF.show();
  1. LAST_VALUE () 坑点:默认只取「从首行到当前行」的最后一个值,如需取分组内所有行的最后值,必须手动加范围:rowsBetween(Window.unboundedPreceding(), Window.unboundedFollowing())
  2. 排名函数必加 ORDER BY:ROW_NUMBER/RANK/DENSE_RANK 无 ORDER BY 会直接抛异常,这是语法强制约束;
  3. 分区字段选型:尽量选择低基数字段(如 staff_id/category_id),避免大分区导致的性能瓶颈;
  4. 性能优化:窗口函数的分区和排序字段建议做索引,Spark 可指定partitionColumn做分区读取。
posted @ 2026-01-12 22:14  vonlinee  阅读(4)  评论(0)    收藏  举报