Hive:大数据表和小表连接采用mapjoin技巧
今天做了一下CTR的数据抽取,在抽取过程中,由于展现表数据量非常大,连接另外一个小表的时候,直接跑崩了。下面是采用mapjoin方式解决了此问题,做个笔记,以便查用:
---- 节点总数:50
---- section_data_tmp_dresult为大表,共23.3 G
---- section_data_tmp_dresult为小表,184.5 M
hive -e "set hive.mapred.mode=nonstrict;insert overwrite table datamining.section_data_tmp_dp
select /*+ MAPJOIN(section_data_tmp_presult) */ dt.userid,dt.infoid,dt.bookid,pt.regtype,pt.firstinittime,......,dt.prob from datamining.section_data_tmp_dresult dt join datamining.section_data_tmp_presult pt on(dt.userid=pt.username);"
PS:查看集群上文件大小命令格式如下,
[huchao@ab-cli03 ~]$ hadoop fs -du -s -h /warehouse/datamining.db/section_data_tmp_presult
23.3 G /warehouse/datamining.db/section_data_tmp_presult
[huchao@ab-cli03 ~]$ hadoop fs -du -s -h /warehouse/datamining.db/section_data_tmp_dresult
184.5 M /warehouse/datamining.db/section_data_tmp_dresult
浙公网安备 33010602011771号