hive 之调优

一、结论

每个窗口函数都有orderby

结论：改成全局一次orderby

子查询：MR

结论：最好不要子查询

优先干死多余数据

【对hive来说】=> 先inner join（不会MR），再where（会MR）
【对mysql来说】=> 先 where（筛数据）再 on（where在前，多个条件，从右向左，先干大的）

三个innerjoin会不会三个mr

结论：不启动mr，可以使用 innerjoin

group by 不影响mr

结论：有 groupby 不影响性能，可以使用

in不如大于小于：因为in要做全表扫描

结论：范围><代替in

from XXX insert XXX

hive独有写法，提前预加载tmp

二、案例

链接中的DM层：https://www.cnblogs.com/sabertobih/p/13965010.html

>>>

需求：当天-> 顾客，产品，日期，订单个数，当天金额 && 近两天 -> 订单个数，近两天金额

<<<

原始hql:

select
d_date,customer_sk,product_sk,
`order_num`,
`order_dailyamount`,
sum(`order_dailyamount`) over(rows between 1 PRECEDING and current row) as recent_amount,
sum(`order_num`) over(rows between 1 PRECEDING and current row) as recent_num
from 
(
select 
dss.d_date,
d.customer_sk,
d.product_sk,
count(d.order_sk) as `order_num`,
sum(d.order_amount) as `order_dailyamount`
from 
dw_sales_source.dwd_fact_sales_order d
inner join dw_sales_source.dwd_dim_date dss 
on d.date_sk = dss.date_sk
group by 
dss.d_date,d.customer_sk,d.product_sk
order by dss.d_date
)T

改进：

不想要子查询： sum(order_dailyamount) over() 有错，但可以 sum(sum(d.order_amount)) over()
窗口函数里有重复order by，挪到全局

select 
dss.d_date,d.customer_sk,d.product_sk,
count(d.order_sk) as order_num,
sum(d.order_amount) as order_dailyamount,
sum(sum(d.order_amount)) over(rows between 1 PRECEDING and current row) as recent_amount,
sum(count(d.order_sk)) over(rows between 1 PRECEDING and current row) as recent_num
from 
dw_sales_source.dwd_fact_sales_order d
inner join dw_sales_source.dwd_dim_date dss 
on d.date_sk = dss.date_sk
group by 
dss.d_date,d.customer_sk,d.product_sk
order by dss.d_date

>>>

需求：2018-10-20 -> 顾客，产品，日期，订单个数，当天金额 && 近两天 -> 订单个数，近两天金额

<<<

使用窗口函数还是group by？

取决于需求！

groupby => 一组一个
窗口函数 => 逐日连续

PS: case when？见行转列 https://www.cnblogs.com/sabertobih/p/13589760.html

-- groupby：每一组中的order数量，这种情况适合用groupby
select 
ddc.customer_sk,ddc.customer_number,ddc.customer_name,ddc.customer_street_address,ddc.custom_zip_code,ddc.customer_city,ddc.customer_state,   
ddp.product_sk,ddp.product_code,ddp.product_name,ddp.product_category,
ddd.d_date,ddd.d_month,ddd.d_month_name,ddd.d_quarter,ddd.d_year,
sum(case when datediff('2018-10-20',ddd.d_date)=0 then 1 else 0 end) current_count,
sum(case when datediff('2018-10-20',ddd.d_date)<=1 then 1 else 0 end) two_count,
sum(case when datediff('2018-10-20',ddd.d_date)=0 then dfo.order_amount else 0 end) current_money,
sum(case when datediff('2018-10-20',ddd.d_date)<=1 then dfo.order_amount else 0 end) two_count
from dw_sales_source.dwd_fact_sales_order dfo
inner join dwd_dim_date ddd on dfo.date_sk = ddd.date_sk
inner join dwd_dim_customer ddc on dfo.customer_sk = ddc.customer_sk
inner join dwd_dim_product ddp on dfo.product_sk = ddp.product_sk
where ddd.d_date>='2018-10-19' and ddd.d_date<='2018-10-20'
group by 
ddc.customer_sk,ddc.customer_number,ddc.customer_name,ddc.customer_street_address,ddc.custom_zip_code,ddc.customer_city,ddc.customer_state,   
ddp.product_sk,ddp.product_code,ddp.product_name,ddp.product_category,
ddd.d_date,ddd.d_month,ddd.d_month_name,ddd.d_quarter,ddd.d_year;

-- 使用窗口函数,还要过滤！麻烦，但如果要看连续不断的，股票图三日均线，很有用
select 
ddc.customer_sk,ddc.customer_number,ddc.customer_name,ddc.customer_street_address,ddc.custom_zip_code,ddc.customer_city,ddc.customer_state,   
ddp.product_sk,ddp.product_code,ddp.product_name,ddp.product_category,
ddd.d_date,ddd.d_month,ddd.d_month_name,ddd.d_quarter,ddd.d_year,
count(dfo.order_sk) over(partition by dfo.customer_sk,dfo.product_sk,dfo.date_sk order by ddd.d_date rows between 1 PRECEDING and current row) 
as recent_amount
from dw_sales_source.dwd_fact_sales_order dfo
inner join dwd_dim_date ddd on dfo.date_sk = ddd.date_sk
inner join dwd_dim_customer ddc on dfo.customer_sk = ddc.customer_sk
inner join dwd_dim_product ddp on dfo.product_sk = ddp.product_sk
where 
ddd.d_date>='2018-10-19' and ddd.d_date<='2018-10-20'

posted @ 2020-11-26 13:44 PEAR2020 阅读(241) 评论(0) 收藏举报

刷新页面返回顶部

PEAR2020

hive 之 调优

一、结论

二、案例

公告

hive 之调优