优化方案
1、数据模型设计优化
1-1、分区与分桶策略 (高频使用)
分区(Partitioning):
-- 按时间或业务逻辑字段(如日期、地区)对表进行分区,减少扫描数据量。
CREATE TABLE ads_order (
order_id BIGINT,
user_id INT,
amount DECIMAL(10,2),
dt DATE
)
PARTITION BY RANGE(dt) (
PARTITION p202301 VALUES [('2023-01-01'), ('2023-02-01')),
PARTITION p202302 VALUES [('2023-02-01'), ('2023-03-01'))
);
分桶(Bucketing):
-- 选择高基数字段(如 user_id、order_id)作为分桶键,确保数据均匀分布,避免数据倾斜。
DISTRIBUTED BY HASH(user_id) BUCKETS 32;
1-2、数据冗余与宽表设计 (高频使用)
对高频关联的小表(如维度表),通过 Broadcast Join 或 Colocate Join 避免 Shuffle。
若多表关联逻辑固定,可设计宽表(预关联表),减少运行时 Join 开销。
2、查询优化 (高频使用)
2-1、Join 策略选择
Broadcast Join:
-- 适用于小表(如维度表)与大表的关联,将小表广播到所有节点,避免 Shuffle。
SELECT /*+ BROADCAST(dim_user) */
o.order_id,
u.user_name,
o.amount
FROM ads_order o
JOIN dim_user u ON o.user_id = u.user_id;
Shuffle Join:
-- 大表间关联时,按 Join Key 重新分布数据,确保相同 Key 的数据在同一节点。
SELECT /*+ SHUFFLE_JOIN */
o.order_id,
p.product_name,
o.amount
FROM ads_order o
JOIN ads_product p ON o.product_id = p.product_id;
Colocate Join:
-- 若多表的分桶键和分桶数一致,直接本地 Join,无需网络传输。
-- 创建表时指定相同分桶规则
CREATE TABLE ads_order (...) DISTRIBUTED BY HASH(user_id) BUCKETS 32;
CREATE TABLE dim_user (...) DISTRIBUTED BY HASH(user_id) BUCKETS 32;
-- 查询时自动触发 Colocate Join
SELECT o.order_id, u.user_name
FROM ads_order o
JOIN dim_user u ON o.user_id = u.user_id;
2-2、谓词下推与过滤
-- 在 Join 前尽量过滤数据,减少参与关联的数据量:
SELECT o.order_id, u.user_name
FROM (SELECT * FROM ads_order WHERE dt = '2023-01-01') o
JOIN dim_user u ON o.user_id = u.user_id;
2-3、避免笛卡尔积
-- 确保 Join 条件包含有效关联字段,避免全表扫描。
3、物化视图加速 (高频使用)
-- 对高频复杂查询,通过 物化视图(Materialized View) 预计算关联结果:
-- 创建物化视图
CREATE MATERIALIZED VIEW mv_order_user
PARTITION BY dt
DISTRIBUTED BY HASH(order_id)
AS
SELECT o.order_id, u.user_name, o.amount, o.dt
FROM ads_order o
JOIN dim_user u ON o.user_id = u.user_id;
-- 查询时自动命中物化视图
SELECT user_name, SUM(amount)
FROM mv_order_user
WHERE dt = '2023-01-01'
GROUP BY user_name;
4、执行计划调优
分析执行计划
使用 EXPLAIN 命令查看查询计划,重点关注:
数据分布:是否触发 Broadcast/Shuffle/Colocate Join。
谓词下推:过滤条件是否提前应用。
资源消耗:是否存在内存溢出(如 EXCEEDED_MEMORY_LIMIT)。
调整并行度
通过 set parallel_fragment_exec_instance_num = 8; 增加并行度,提升计算资源利用率。
调整 BE(Backend)节点的资源配置,避免资源竞争。
5、统计信息收集
定期收集表统计信息,帮助优化器生成高效执行计划:
-- 手动收集统计信息
ANALYZE TABLE ads_order COMPUTE STATISTICS;
6、资源隔离与优先级
为 ADS 层查询分配独立资源组(Resource Group),避免与其他任务竞争资源:
CREATE RESOURCE GROUP ads_group
TO ('user_ads')
WITH (
'cpu_core_limit' = '16',
'mem_limit' = '80%'
);
7、冷热数据分层
对历史冷数据使用 冷存储策略(如对象存储),降低存储成本。
对热数据保留在本地 SSD,提升查询性能。
Explain(逻辑执行计划) 操作案例:
对一下三个语句进行执行计划分析对比,查询优化
------------------------------------------
--
EXPLAIN
select custId,commOwner,commSchool,commTime,commStatus,commprojIntention from (
select a.custId ,a.commOwner ,a.commSchool ,a.commTime ,a.commStatus, b.commprojIntention, ROW_NUMBER() over(PARTITION by b.custId order by b.commTime) as rnk
from datawarehouse.dwd_ct_CtComm a
join datawarehouse.dwd_ct_CtCommproj b on a.commId = b.commId and a.commDelstatus = 'N' and a.isPhysicsDel = 2 and b.commprojDelstatus = 'N' and b.isPhysicsDel = 2
) t where rnk = 1 and custId <> ''
-----------------------------------------
--
EXPLAIN
select c.custId, c.commOwner, c.commSchool, c.commTime, c.commStatus,a.commprojIntention
from ( select commId ,min(commprojIntention) commprojIntention from datawarehouse.dwd_ct_CtCommproj where isPhysicsDel = 2 group by commId ) a
join (
select a.commId ,a.custId, a.commOwner, a.commSchool, a.commTime, a.commStatus
from datawarehouse.dwd_ct_CtComm a
where commId in ( select min(commId) as commId from datawarehouse.dwd_ct_CtComm where commDelstatus = 'N' AND isPhysicsDel = 2 group by custId )
) c on a.commId =c.commId
-----------------------------------------
--
EXPLAIN
select a.custId, a.commOwner, a.commSchool, a.commTime, a.commStatus, c.commprojIntention
from datawarehouse.dwd_ct_CtComm a
join (select min(commId) commId from datawarehouse.dwd_ct_CtComm where commDelstatus = 'N' AND isPhysicsDel = 2 group by custId) b on a.commId =b.commId
join (select commId,min(commprojIntention) commprojIntention from datawarehouse.dwd_ct_CtCommproj where isPhysicsDel = 2 group by commId ) c on b.commId =c.commId
执行计划:
-----------------------------------------
--
PLAN FRAGMENT 0
OUTPUT EXPRS:8: custId | 7: commOwner | 19: commSchool | 9: commTime | 10: commStatus | 36: commprojIntention
PARTITION: UNPARTITIONED
RESULT SINK
13:EXCHANGE
PLAN FRAGMENT 1 -- 任务碎片
OUTPUT EXPRS: -- 输出表达式
PARTITION: HASH_PARTITIONED: 33: custId -- 分区
STREAM DATA SINK
EXCHANGE ID: 13
UNPARTITIONED -- 未分区
12:Project
| <slot 7> : 7: commOwner
| <slot 8> : 8: custId
| <slot 9> : 9: commTime
| <slot 10> : 10: commStatus
| <slot 19> : 19: commSchool
| <slot 36> : 36: commprojIntention
|
11:SELECT
| predicates: 44: row_number() = 1, 8: custId != ''
|
10:ANALYTIC
| functions: [, row_number(), ]
| partition by: 33: custId
| order by: 31: commTime ASC
| window: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
|
9:SORT
| order by: <slot 33> 33: custId ASC, <slot 31> 31: commTime ASC
| offset: 0
|
8:EXCHANGE
PLAN FRAGMENT 2
OUTPUT EXPRS:
PARTITION: RANDOM
STREAM DATA SINK
EXCHANGE ID: 08
HASH_PARTITIONED: 33: custId
7:PARTITION-TOP-N
| partition by: 33: custId
| partition limit: 1
| order by: <slot 33> 33: custId ASC, <slot 31> 31: commTime ASC
| offset: 0
|
6:Project
| <slot 7> : 7: commOwner
| <slot 8> : 8: custId
| <slot 9> : 9: commTime
| <slot 10> : 10: commStatus
| <slot 19> : 19: commSchool
| <slot 31> : 31: commTime
| <slot 33> : 33: custId
| <slot 36> : 36: commprojIntention
|
5:HASH JOIN
| join op: INNER JOIN (BUCKET_SHUFFLE)
| colocate: false, reason:
| equal join conjunct: 24: commId = 1: commId
|
|----4:EXCHANGE
|
1:Project
| <slot 24> : 24: commId
| <slot 31> : 31: commTime
| <slot 33> : 33: custId
| <slot 36> : 36: commprojIntention
|
0:OlapScanNode
TABLE: dwd_ct_CtCommproj
PREAGGREGATION: OFF. Reason: None aggregate function
PREDICATES: 38: commprojDelstatus = 'N', 42: isPhysicsDel = 2
partitions=1/1
rollup: dwd_ct_CtCommproj
tabletRatio=60/60
tabletList=76203,76207,76211,76215,76219,76223,76227,76231,76235,76239 ...
cardinality=150433003
avgRowSize=19.56558
numNodes=0
PLAN FRAGMENT 3
OUTPUT EXPRS:
PARTITION: RANDOM
STREAM DATA SINK
EXCHANGE ID: 04
BUCKET_SHUFFLE_HASH_PARTITIONED: 1: commId
3:Project
| <slot 1> : 1: commId
| <slot 7> : 7: commOwner
| <slot 8> : 8: custId
| <slot 9> : 9: commTime
| <slot 10> : 10: commStatus
| <slot 19> : 19: commSchool
|
2:OlapScanNode
TABLE: dwd_ct_CtComm -- 表名
PREAGGREGATION: OFF. Reason: Has can not pre-aggregation Join
PREDICATES: 18: commDelstatus = 'N', 22: isPhysicsDel = 2
partitions=1/1
rollup: dwd_ct_CtComm
tabletRatio=60/60
tabletList=75066,75070,75074,75078,75082,75086,75090,75094,75098,75102 ...
cardinality=103274862 -- 扫描表的数据总行数
avgRowSize=28.0 -- 扫描数据行的平均大小
numNodes=0
-- 另外两个省略
分析图如下:
![alt text]()
![alt text]()
![alt text]()