Mysql 统计标签出现次数（一行变多行）

需求背景

需求

一张数据表
其中有sid字段，代表tag，每行数据可能有多个tag字段
统计全量数据中所有tag出现的次数（按tag分组，分别有多少数据）

source table demo

id	sid
1	a3,a4,a1,a2,a5
2	a5,a3,a4,a2
3	a5,a3,a4

target table demo

sid_tag	sid_occurrence
a1	1
a2	2
a3	3
a4	3
a5	3

SQL query in mysql

select substring_index( substring_index( sid , ',', id_table.help_topic_id + 1 ), ',',- 1 ) as sid_tag,count(sid) as sid_occurrence from src_t JOIN mysql.help_topic id_table on id_table.help_topic_id < (
  length(src_t.sid) - length(replace(src_t.sid, ',', '')) + 1
) GROUP BY sid_tag

hive

select tag,count(*) from(
select tag from table_name
lateral view explode(split(taglist,',')) r1 AS tag) a group by a.tag;

解析

利用系统表mysql.help_topic进行了index的记录（可以使用其他表），更暴力的解法是直接新建一张只有id列的表，可以预防系统表id不够用（一般足够了）
JOIN a.id < b.id，解决了复制行的操作；若b.id = 5，a.id 为（1，2，3，4，5），返回数据为 [(1,5),(2,5),(3,5),(4,5)]
substring_index( substring_index( sid , ',', id_table.help_topic_id + 1 ), ',',- 1 ) 利用index表中的index，作为子串索引进行数据拆分。索引构建基于上一步的join。
随后正常groupby统计

实验

mysql

select * from (
  SELECT concat('a',topic.help_topic_id) as sid,topic.help_topic_id AS id
  FROM mysql.help_topic topic
  WHERE help_topic_id IN ('1', '2', '3', '4', '5')
) as id_table JOIN (
  SELECT concat('b',topic.help_topic_id) as sid,topic.help_topic_id AS id
  FROM mysql.help_topic topic
  WHERE help_topic_id IN ('1', '2', '3')
) as id_table2 on id_table2.id <= id_table.id

结果

sid	id	sid	id
a1	1	b1	1
a2	2	b1	1
a2	2	b2	2
a3	3	b1	1
a3	3	b2	2
a3	3	b3	3
a4	4	b1	1
a4	4	b2	2
a4	4	b3	3
a5	5	b1	1
a5	5	b2	2
a5	5	b3	3

posted @ 2020-06-24 15:29 苏轶然阅读(1527) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

苏轶然

抱你于怀，寄你于心。跃然屏上，付我青春韶华，换你真情相待。

Mysql 统计标签出现次数（一行变多行）

需求背景

需求

source table demo

target table demo

SQL query in mysql

hive

解析

实验

mysql

结果

公告

sid	id	sid	id
a1	1	b1	1
a2	2	b1	1
a2	2	b2	2
a3	3	b1	1
a3	3	b2	2
a3	3	b3	3
a4	4	b1	1
a4	4	b2	2
a4	4	b3	3
a5	5	b1	1
a5	5	b2	2
a5	5	b3	3

sid	id	sid	id
a1	1	b1	1
a2	2	b1	1
a2	2	b2	2
a3	3	b1	1
a3	3	b2	2
a3	3	b3	3
a4	4	b1	1
a4	4	b2	2
a4	4	b3	3
a5	5	b1	1
a5	5	b2	2
a5	5	b3	3

sid	id	sid	id
a1	1	b1	1
a2	2	b1	1
a2	2	b2	2
a3	3	b1	1
a3	3	b2	2
a3	3	b3	3
a4	4	b1	1
a4	4	b2	2
a4	4	b3	3
a5	5	b1	1
a5	5	b2	2
a5	5	b3	3