读取连接表生成CSV两种方案之比较

前篇：https://www.cnblogs.com/heyang78/p/15860075.html

【需求】

目标系统存在三张表，用户表Customer，最大有两百万数据；标记表tag,一开始定为10000个，后来改为1600个；两者的连接表Customer_Tag,这个数据多，最多时为2，000，000*1600=32,0000,0000条。现在需要将每个用户拥有的tag以csv的形式输出出来，tag存在标1，Tag不存在标0，中间以逗号分隔。

比如总共有10个tag，用户A拥有tagid=1，3，5，7，9的tag，那么该输出这样的数据：

A，1，0，1，0，1，0，1，0，1，0

其它数据类推

【实现比较】

	原有方案	新方案
思路	读取连接表，按用户id获得所有tagid，然后用listagg将tagid串起来。在Python中劈分tagid串，重组成0，1，1，0这种形式。	读取三张表，连续连接三张表，最终形成customername，0，1，0，1，0...的形式
SQL处理部分	读取连接表，按用户id获得所有tagid，然后用listagg将tagid串起来。	绝大部分业务
SQL	select a.name,b.tags from (select id,name from customer where {fromId}<id and id<={toId}) a inner join (select cid,listagg(tid,',') within group(order by 1) as tags from customer_tag group by cid) b on a.id=b.cid	select ct.name,f.tags as line from ( select e.cid,listagg(e.tg,',') within group (order by e.sn) as tags from ( select c.cid,c.sn, decode(nvl(d.tid,0),0,'0','1') as tg from (select a.sn,a.val,b.id as cid from (select level as sn,'0' as val from dual connect by level<=1000) a, (select id from customer where 10000<id and id<=20000) b ) c left join (select * from customer_tag where 10000<cid and cid<=20000) d on c.cid= d.cid and c.sn=d.tid ) e group by e.cid ) f left join customer ct on f.cid=ct.id order by f.cid
Python处理部分	创建1-1000的序列，劈分tagid串，按下标到序列中改0为1，这一步相当于数组位操作，很快。	只用将用户名和tag连起来。
耗时	82秒	162秒
优势	整体耗时少，tagid累加时完全不用排序	无论在实验环境（1000个tag）和实际环境（1600个tag）都无需担心超listagg上限的问题
劣势	tagid累加时可能超4000上限	整体耗时长，每个customer对应的tag都要排序是耗时瓶颈所在。
源码	https://www.cnblogs.com/pyhy/p/15855992.html	https://www.cnblogs.com/pyhy/p/15862430.html

以上试验在T440p，Oracle11G上进行。

posted @ 2022-02-04 09:55 逆火狂飙阅读(55) 评论(0) 收藏举报

刷新页面返回顶部