词距与关键词之间的词频词距统计操作

需求 : tid :帖子的id
sentence num:句子在帖子中的编号
wordindex:分好的每个词 在帖子中的位置
channel :这条帖子的关键词
要求 计算word 在每一句中 与channel的词距 ,若词距绝对值小于三 则 word 和channel 标签为关联 若关联数 /出现的总数>=2/3; 则此word和channel组成新词,tip为1:
代码实现:
1 DROP TABLE IF EXISTS temp_1; 2 CREATE TEMP TABLE temp_1 AS 3 SELECT 4 tid, 5 sentence_num, 6 wordindex, 7 channel 8 FROM "tmp10" 9 WHERE 10 word = channel 11 LIMIT 100 12 ; 13 14 SELECT 15 t1.*, 16 t2.* 17 FROM temp_1 t1 18 LEFT JOIN "tmp10" t2 19 on t2.tid= t1.tid 20 AND t2.sentence_num = t1.sentence_num 21 22 23 DROP TABLE IF EXISTS tmp_2; 24 CREATE TEMP table tmp_2 as 25 SELECT 26 t1.channel, 27 t1.wordindex as channel_index, 28 t2.word, 29 t2.wordindex, 30 CASE 31 WHEN (t1.wordindex::int - t2.wordindex::int) BETWEEN -3 AND 3 THEN 1 32 ELSE 0 33 END AS dis 34 FROM temp_1 t1 35 LEFT JOIN "tmp10" t2 36 on t2.tid= t1.tid 37 AND t2.sentence_num = t1.sentence_num 38 ; 39 40 SELECT 41 channel, 42 word, 43 (COUNT(*) FILTER(WHERE dis = 1))::FLOAT/COUNT(*) 44 FROM tmp_2 45 GROUP BY 46 channel, 47 word
 
                    
                     
                    
                 
                    
                 
                
            
         
         浙公网安备 33010602011771号
浙公网安备 33010602011771号