Clickhouse join优化之分桶优化
1、背景
ck在单表查询能够做到极致,但是在join上性能就相对尬尴,
A JOIN B 特别是当两张表的数据都不小的时候,经常就会有内存溢出,超时等等情况
特别是当AB都为分布表的时候
就拿常用的事件表(events_all)和用户表(users_all)做JOIN为例,都是分布表;
以下为clickhouse分桶测试(针对账号pid进行hash分桶,按月分区,相同的pid会落到同一个分片上,增加join效率,减少网络IO)
2、表结构创建
事件本地表 CREATE TABLE test.join_event_local on cluster cluster_3shards_2replicas
(
`ds` String,
`gn` String,
`log_type` String,
`server_id` Int32,
`pid` Int32,
`plosgn` Int8,
`role_uid` Int64,
`log_time` Int32,
`gold_num` Int32,
`gold_type` Int32,
`function_key` Int32,
`direction` Int8,
`item_id` Int32,
`item_num` Int64,
`uuid` String,
`level` Int32,
`vip` Int32,
`remain` Int32,
`salt` Int32,
`plosgn_lo` Int8,
`kv` String,
`state` String,
`flume2hdfs` String,
`updatetime` String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/join_event_local', '{replica}')
PARTITION BY toYYYYMM(toDate(ds))
ORDER BY (gn, server_id, log_time, item_id, uuid, plosgn,intHash64(pid))
SAMPLE BY intHash64(pid)
SETTINGS index_granularity = 8192;
事件分布式表
CREATE TABLE test.join_event_all on cluster cluster_3shards_2replicas
as test.join_event_local
ENGINE = Distributed('cluster_3shards_2replicas', 'test', 'join_event_local',intHash64(pid)) ;
用户本地表
CREATE TABLE test.join_user_local on cluster cluster_3shards_2replicas
(
`ds` String,
`gn` String,
`pid` Int32,
`qid` String,
`active_time` Int64,
`login_ip` Int64,
`source` String,
`bind` String,
`os` String,
`create_time` Int64,
`last_login_time` Int64,
`isvalid` Int32,
`groupid` Int32,
`status` Int8,
`uuid_hash_low` String,
`uuid_hash_high` String,
`plosgn` Int32,
`groupbase` Int32,
`state` String,
`timezone` String,
`a` String,
`b` String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/join_user_local', '{replica}')
PARTITION BY toYYYYMM(toDate(ds))
ORDER BY (gn, pid, intHash64(pid), active_time, login_ip, source, bind, os, create_time, last_login_time, isvalid, groupid, status, uuid_hash_low, uuid_hash_high, plosgn, groupbase)
SAMPLE BY intHash64(pid)
SETTINGS index_granularity = 8192 ;
用户分布式表
CREATE TABLE test.join_user_all on cluster cluster_3shards_2replicas
as test.join_user_local
ENGINE = Distributed('cluster_3shards_2replicas', 'test', 'join_user_local', intHash64(pid))
查询结果
***************非分桶join查询1*******************
测试sql1:账号关联,日期关联计数(256秒)
select count(a.pid) from (select pid,ds from test.event_all) a
join (select pid,ds from test.user_all) b on a.pid=b.pid and a.ds=b.ds group by ds order by ds limit 10;
***************分桶join查询1*******************
测试sql1:账号关联,日期关联计数 (分桶表常规写法查询,依然很慢,309秒)
select ds,count(a.pid) from (select pid,ds from test.join_event_all) a
join (select pid,ds from test.join_user_all) b on a.pid=b.pid and a.ds=b.ds group by ds order by ds limit 10;
(分桶表关联查询ck写法,62秒)
set distributed_product_mode = 'global';
select a.ds,count(a.pid) from test.join_event_all a
global join test.join_user_all b using(pid,ds) group by ds order by ds limit 10;
***************非分桶join查询2*******************
测试sql2:大表关联小表 账号关联:
select count(a.pid) from (select pid from test.event_all) a join (select pid from test.user_all) b on a.pid=b.pid; (非分桶 30秒)
***************分桶join查询2**********************
select count(a.pid) from test.join_event_all a
global join test.join_user_all b using(pid) limit 10; (分桶查询46秒)
注意:查询的时候不要B表括号起来select *from (select form A) a join (select from B) b ...这样 否则优化无效,这跟ck的解析有关
总结:当大表和大表进行关联的时候,效率提升明显(选定好常用的字段进行hash分桶,比如账号。。)(大小表关联条件单一的时候,可能效果不是太明显)
参考:https://blog.csdn.net/a495679822/article/details/118548564
posted on 2022-09-09 11:22 RICH-ATONE 阅读(1450) 评论(0) 收藏 举报
浙公网安备 33010602011771号