terasort 测试
平台
基于hadoop 0.20.205.0,redhat linux server 6.0
实验过程
生成数据
hadoop jar hadoop-examples-0.20.205.0.jar teragen 1000000 tera1000000-input
注:这里也可以写成
hadoop jar hadoop-examples-0.20.205.0.jar teragen –Dmapred.map.tasks=16 1000000 tera1000000-input 从而手动设定map task的个数
加入参数hadoop jar ... -Dmapred.reduce.slowstart.completed.maps=0.95 ...
The variable that controls when reducers start with respect to mappers is controlled bymapred.reduce.slowstart.completed.maps, which is a number between 0 and 1. At 0, the reducers start when mappers start. At 0.75, reducers start when 75% of mappers have finished. In stock hadoop, it is set to 0.2, while MapR sets it to 0.95.
hadoop jar examples.jar
terasort-Dmapred.reduce.
tasks=30 /teragen_input_dir /teragen_output_dir
用来设定reduce的个数
这里生成了1000000行的数据,每行数据位100字节,也就是100MB的数据,默认是采用2个map任务(每个生成一半大小的原数据)来生成。如下所示:
Warning: $HADOOP_HOME is deprecated.
Generating 1000000 using 2 maps with step of 500000
12/03/04 17:18:17 INFO mapred.JobClient: Running job: job_201202281948_0035
12/03/04 17:18:18 INFO mapred.JobClient: map 0% reduce 0%
12/03/04 17:18:31 INFO mapred.JobClient: map 50% reduce 0%
12/03/04 17:18:38 INFO mapred.JobClient: map 100% reduce 0%
12/03/04 17:18:43 INFO mapred.JobClient: Job complete: job_201202281948_0035
12/03/04 17:18:43 INFO mapred.JobClient: Counters: 19
12/03/04 17:18:43 INFO mapred.JobClient: Job Counters
12/03/04 17:18:43 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22516
12/03/04 17:18:43 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/03/04 17:18:43 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/03/04 17:18:43 INFO mapred.JobClient: Launched map tasks=2
12/03/04 17:18:43 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
12/03/04 17:18:43 INFO mapred.JobClient: File Input Format Counters
12/03/04 17:18:43 INFO mapred.JobClient: Bytes Read=0
12/03/04 17:18:43 INFO mapred.JobClient: File Output Format Counters
12/03/04 17:18:43 INFO mapred.JobClient: Bytes Written=100000000
12/03/04 17:18:43 INFO mapred.JobClient: FileSystemCounters
12/03/04 17:18:43 INFO mapred.JobClient: HDFS_BYTES_READ=167
12/03/04 17:18:43 INFO mapred.JobClient: FILE_BYTES_WRITTEN=42910
12/03/04 17:18:43 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=100000000
12/03/04 17:18:43 INFO mapred.JobClient: Map-Reduce Framework
12/03/04 17:18:43 INFO mapred.JobClient: Map input records=1000000
12/03/04 17:18:43 INFO mapred.JobClient: Physical memory (bytes) snapshot=244236288
12/03/04 17:18:43 INFO mapred.JobClient: Spilled Records=0
12/03/04 17:18:43 INFO mapred.JobClient: CPU time spent (ms)=5980
12/03/04 17:18:43 INFO mapred.JobClient: Total committed heap usage (bytes)=174325760
12/03/04 17:18:43 INFO mapred.JobClient: Virtual memory (bytes) snapshot=13373423616
12/03/04 17:18:43 INFO mapred.JobClient: Map input bytes=1000000
12/03/04 17:18:43 INFO mapred.JobClient: Map output records=1000000
12/03/04 17:18:43 INFO mapred.JobClient: SPLIT_RAW_BYTES=167
生成的数据如下图所示:
前面的部分是key,后面的部分是value,排序的任务就是按照key的顺序来排。
执行排序
因为之前我们生成了100MB的数据,默认1个数据块为64MB,所以执行排序的时候会产生2个maptask
hadoop jar hadoop-examples-0.20.205.0.jar terasort tera1000000-input tera-out
执行过程如下所示:
12/03/04 17:30:15 INFO terasort.TeraSort: starting
12/03/04 17:30:16 INFO mapred.FileInputFormat: Total input paths to process : 2
12/03/04 17:30:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/03/04 17:30:16 INFO compress.CodecPool: Got brand-new compressor
Making 1 from 100000 records
Step size is 100000.0
12/03/04 17:30:17 INFO mapred.FileInputFormat: Total input paths to process : 2
12/03/04 17:30:17 INFO mapred.JobClient: Running job: job_201202281948_0036
12/03/04 17:30:18 INFO mapred.JobClient: map 0% reduce 0%
12/03/04 17:30:34 INFO mapred.JobClient: map 100% reduce 0%
12/03/04 17:30:46 INFO mapred.JobClient: map 100% reduce 16%
12/03/04 17:30:52 INFO mapred.JobClient: map 100% reduce 100%
12/03/04 17:30:57 INFO mapred.JobClient: Job complete: job_201202281948_0036
12/03/04 17:30:57 INFO mapred.JobClient: Counters: 30
12/03/04 17:30:57 INFO mapred.JobClient: Job Counters
12/03/04 17:30:57 INFO mapred.JobClient: Launched reduce tasks=1
12/03/04 17:30:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=26218
12/03/04 17:30:57 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/03/04 17:30:57 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/03/04 17:30:57 INFO mapred.JobClient: Rack-local map tasks=2
12/03/04 17:30:57 INFO mapred.JobClient: Launched map tasks=2
12/03/04 17:30:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16450
12/03/04 17:30:57 INFO mapred.JobClient: File Input Format Counters
12/03/04 17:30:57 INFO mapred.JobClient: Bytes Read=100000000
12/03/04 17:30:57 INFO mapred.JobClient: File Output Format Counters
12/03/04 17:30:57 INFO mapred.JobClient: Bytes Written=100000000
12/03/04 17:30:57 INFO mapred.JobClient: FileSystemCounters
12/03/04 17:30:57 INFO mapred.JobClient: FILE_BYTES_READ=204000288
12/03/04 17:30:57 INFO mapred.JobClient: HDFS_BYTES_READ=100000230
12/03/04 17:30:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=306067170
12/03/04 17:30:57 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=100000000
12/03/04 17:30:57 INFO mapred.JobClient: Map-Reduce Framework
12/03/04 17:30:57 INFO mapred.JobClient: Map output materialized bytes=102000012
12/03/04 17:30:57 INFO mapred.JobClient: Map input records=1000000
12/03/04 17:30:57 INFO mapred.JobClient: Reduce shuffle bytes=51000006
12/03/04 17:30:57 INFO mapred.JobClient: Spilled Records=3000000
12/03/04 17:30:57 INFO mapred.JobClient: Map output bytes=100000000
12/03/04 17:30:57 INFO mapred.JobClient: Total committed heap usage (bytes)=759627776
12/03/04 17:30:57 INFO mapred.JobClient: CPU time spent (ms)=26240
12/03/04 17:30:57 INFO mapred.JobClient: Map input bytes=100000000
12/03/04 17:30:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=230
12/03/04 17:30:57 INFO mapred.JobClient: Combine input records=0
12/03/04 17:30:57 INFO mapred.JobClient: Reduce input records=1000000
12/03/04 17:30:57 INFO mapred.JobClient: Reduce input groups=1000000
12/03/04 17:30:57 INFO mapred.JobClient: Combine output records=0
12/03/04 17:30:57 INFO mapred.JobClient: Physical memory (bytes) snapshot=750379008
12/03/04 17:30:57 INFO mapred.JobClient: Reduce output records=1000000
12/03/04 17:30:57 INFO mapred.JobClient: Virtual memory (bytes) snapshot=20332793856
12/03/04 17:30:57 INFO mapred.JobClient: Map output records=1000000
12/03/04 17:30:57 INFO terasort.TeraSort: done
校验
hadoop jar hadoop-examples-0.20.205.0.jar teravalidate tera-out/part-00000 out
可以发现生成的文件大小为0,也就是说没有排序错误的key
时间对比
处理1GB数据cpu time (ms) |
|
2map |
4map |
8map |
16map |
|
Teragen |
4460 |
5610 |
8340 |
13410 |
|
|
Terasort |
19640 |
17420 |
21000 |
29260 |
|
|
|
|
|||||
处理10GB数据cpu time (ms) |
|
2map |
4map |
8map |
16map |
64map |
Teragen |
27010 |
31130 |
37770 |
42460 |
74920 |
|
Terasort |
148450 |
150960 |
148490 |
147470 |
169530 |
后来发现我这里错了,因为teragen的参数不是指文件大小,而是指多少行!!所以这个表可以作废了
分析综合
其实它本身有job_history_summary.py写了一个脚本用于统计,我将它稍微改了下,从文件读取并且从文本输出,相当于从log文件中根据正则表达式匹配来获取对应的数据。