Hive:大数据表和小表连接采用mapjoin技巧

今天做了一下CTR的数据抽取,在抽取过程中,由于展现表数据量非常大,连接另外一个小表的时候,直接跑崩了。下面是采用mapjoin方式解决了此问题,做个笔记,以便查用:

---- 节点总数:50
---- section_data_tmp_dresult为大表,共23.3 G
---- section_data_tmp_dresult为小表,184.5 M


hive -e "set hive.mapred.mode=nonstrict;insert overwrite table datamining.section_data_tmp_dp
select /*+ MAPJOIN(section_data_tmp_presult) */ dt.userid,dt.infoid,dt.bookid,pt.regtype,pt.firstinittime,......,dt.prob from datamining.section_data_tmp_dresult dt join datamining.section_data_tmp_presult pt on(dt.userid=pt.username);"


PS:查看集群上文件大小命令格式如下,
[huchao@ab-cli03 ~]$ hadoop fs -du -s -h /warehouse/datamining.db/section_data_tmp_presult
23.3 G /warehouse/datamining.db/section_data_tmp_presult
[huchao@ab-cli03 ~]$ hadoop fs -du -s -h /warehouse/datamining.db/section_data_tmp_dresult
184.5 M /warehouse/datamining.db/section_data_tmp_dresult

posted on 2014-05-21 09:15  云梦山庄  阅读(824)  评论(0)    收藏  举报

导航