使用TPC-H数据测试HIVE行存储及列存储的优劣

本文主要是测试了Hive中行存储和列存储(RCFile)之间的优劣。

1.TPCH

可以在http://www.tpc.org/tpch/ 获得源码，我下载的版本是2.14.0。

下载源码后，根据自己的系统修改makefile文件，比如我修改成如下形式：

CC =gcc

DATABASE= DB2

MACHINE = LINUX

WORKLOAD = TPCH

TPCH默认生成的数据格式是col1|col2|col3|，然而有的数据库的输入格式是col1|col2|col3，想要得到该种数据格式，修改tpch的源码dss.h文件：

/*#define PR_END(fp) fprintf(fp, "\n")*/ /* finish the record here */

#define PR_END(fp) {fseek(fp,-1,SEEK_CUR); fprintf(fp,"\n");}

然后makefile，则可以得到dbgen的可执行程序了。

使用./dbgen -h可以看到命令行的帮助信息。

2. HIVE

接下来说一下HIVE的配置。由于我想比较Hive中行存储数据及列存储数据的优劣，同时希望行列数据都是使用ZLIB压缩后的，因此需要修改一下Hadoop和Hive的配置。

Hadoop中需要修改的配置文件是mapred-site.xml：

1 #mapred-site.xml
2   <property>
3     <name>mapred.output.compression.type</name>
4     <value>BLOCK</value>
5     <description>If the job outputs are to compressed as SequenceFiles, how should
6     they be compressed? Should be one of NONE, RECORD or BLOCK.
7     Cloudera's Distribution for Hadoop switches this default to BLOCK
8     for better performance.
9     </description>
10   </property>
11
12   <property>
13     <name>mapred.output.compress</name>
14     <value>true</value>
15   </property>
16   <property>
17     <name>mapred.compress.map.output</name>
18     <value>true</value>
19   </property>
20   <property>
21     <name>mapred.output.compression.codec</name>
22     <value>org.apache.hadoop.io.compress.GZipCodec</value>
23   </property>
24   <property>
25     <name>mapred.map.output.compression.codec</name>
26     <value>org.apache.hadoop.io.compress.GZipCodec</value>

27 </property>

同时Hive中也需要修改：

1 #hive-default.xml

2 <property>
3   <name>hive.exec.compress.output</name>
4   <value>true</value>
5   <description> This controls whether the final outputs of a query (to a local/hdfs file or a hive table) is compressed. The compression codec and other options are determined from hadoop config variables mapred.output.compress* </description>
6 </property>
7
8 <property>
9   <name>hive.exec.compress.intermediate</name>
10   <value>true</value>
11   <description> This controls whether intermediate files produced by hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from hadoop config variables mapred.output.compress* </description>
12 </property>

OK，配置好了，重启Hadoop，进入Hive，正式测试。

3.测试

加载100G TPC-H数据集的磁盘空间测试：

用于测试的SQL:

SELECT lineitem.returnflag, lineitem.linestatus, SUM (lineitem.extendedprice * (1-lineitem.discout)), AVG (lineitem.discout)
FROM lineitem
WHERE lineitem.shipdate <= '1998-11-28'
AND lineitem.orderkey > 1000
GROUP BY lineitem.returnflag,lineitem.linestatus;

测试结果

RCFile的查询比行式存储查询快的原因主要是在Map阶段，由于每个Map读入的数据量更小，IO开销小，因此能在更短的时间内完成Map。

4.总结

根据以上测试可见，RCFile的优势还是很大的，它在不降低查询性能的前提下比开源数据仓库系统（Hive）中的行存储技术节省磁盘存储空间。

posted on 2011-08-15 09:54 Shall 阅读(3225) 评论(0) 收藏举报

刷新页面返回顶部

Shall

使用TPC-H数据测试HIVE行存储及列存储的优劣

公告

导航