terasort 测试

平台

基于hadoop 0.20.205.0,redhat linux server 6.0

实验过程

生成数据

hadoop jar hadoop-examples-0.20.205.0.jar teragen 1000000 tera1000000-input

注:这里也可以写成

hadoop jar hadoop-examples-0.20.205.0.jar teragen –Dmapred.map.tasks=16 1000000 tera1000000-input 从而手动设定map task的个数

 

加入参数hadoop jar ... -Dmapred.reduce.slowstart.completed.maps=0.95  ...

The variable that controls when reducers start with respect to mappers is controlled bymapred.reduce.slowstart.completed.maps, which is a number between 0 and 1. At 0, the reducers start when mappers start. At 0.75, reducers start when 75% of mappers have finished. In stock hadoop, it is set to 0.2, while MapR sets it to 0.95. 

 

hadoop jar examples.jar terasort -Dmapred.reduce.tasks=30  /teragen_input_dir  /teragen_output_dir

用来设定reduce的个数

这里生成了1000000行的数据,每行数据位100字节,也就是100MB的数据,默认是采用2个map任务(每个生成一半大小的原数据)来生成。如下所示:

Warning: $HADOOP_HOME is deprecated.

 

Generating 1000000 using 2 maps with step of 500000

12/03/04 17:18:17 INFO mapred.JobClient: Running job: job_201202281948_0035

12/03/04 17:18:18 INFO mapred.JobClient:  map 0% reduce 0%

12/03/04 17:18:31 INFO mapred.JobClient:  map 50% reduce 0%

12/03/04 17:18:38 INFO mapred.JobClient:  map 100% reduce 0%

12/03/04 17:18:43 INFO mapred.JobClient: Job complete: job_201202281948_0035

12/03/04 17:18:43 INFO mapred.JobClient: Counters: 19

12/03/04 17:18:43 INFO mapred.JobClient:   Job Counters

12/03/04 17:18:43 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=22516

12/03/04 17:18:43 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0

12/03/04 17:18:43 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0

12/03/04 17:18:43 INFO mapred.JobClient:     Launched map tasks=2

12/03/04 17:18:43 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0

12/03/04 17:18:43 INFO mapred.JobClient:   File Input Format Counters

12/03/04 17:18:43 INFO mapred.JobClient:     Bytes Read=0

12/03/04 17:18:43 INFO mapred.JobClient:   File Output Format Counters

12/03/04 17:18:43 INFO mapred.JobClient:     Bytes Written=100000000

12/03/04 17:18:43 INFO mapred.JobClient:   FileSystemCounters

12/03/04 17:18:43 INFO mapred.JobClient:     HDFS_BYTES_READ=167

12/03/04 17:18:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=42910

12/03/04 17:18:43 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=100000000

12/03/04 17:18:43 INFO mapred.JobClient:   Map-Reduce Framework

12/03/04 17:18:43 INFO mapred.JobClient:     Map input records=1000000

12/03/04 17:18:43 INFO mapred.JobClient:     Physical memory (bytes) snapshot=244236288

12/03/04 17:18:43 INFO mapred.JobClient:     Spilled Records=0

12/03/04 17:18:43 INFO mapred.JobClient:     CPU time spent (ms)=5980

12/03/04 17:18:43 INFO mapred.JobClient:     Total committed heap usage (bytes)=174325760

12/03/04 17:18:43 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=13373423616

12/03/04 17:18:43 INFO mapred.JobClient:     Map input bytes=1000000

12/03/04 17:18:43 INFO mapred.JobClient:     Map output records=1000000

12/03/04 17:18:43 INFO mapred.JobClient:     SPLIT_RAW_BYTES=167

 

生成的数据如下图所示:

 

前面的部分是key,后面的部分是value,排序的任务就是按照key的顺序来排。

执行排序

因为之前我们生成了100MB的数据,默认1个数据块为64MB,所以执行排序的时候会产生2个maptask

hadoop jar hadoop-examples-0.20.205.0.jar terasort tera1000000-input tera-out

执行过程如下所示:

12/03/04 17:30:15 INFO terasort.TeraSort: starting

12/03/04 17:30:16 INFO mapred.FileInputFormat: Total input paths to process : 2

12/03/04 17:30:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

12/03/04 17:30:16 INFO compress.CodecPool: Got brand-new compressor

Making 1 from 100000 records

Step size is 100000.0

12/03/04 17:30:17 INFO mapred.FileInputFormat: Total input paths to process : 2

12/03/04 17:30:17 INFO mapred.JobClient: Running job: job_201202281948_0036

12/03/04 17:30:18 INFO mapred.JobClient:  map 0% reduce 0%

12/03/04 17:30:34 INFO mapred.JobClient:  map 100% reduce 0%

12/03/04 17:30:46 INFO mapred.JobClient:  map 100% reduce 16%

12/03/04 17:30:52 INFO mapred.JobClient:  map 100% reduce 100%

12/03/04 17:30:57 INFO mapred.JobClient: Job complete: job_201202281948_0036

12/03/04 17:30:57 INFO mapred.JobClient: Counters: 30

12/03/04 17:30:57 INFO mapred.JobClient:   Job Counters

12/03/04 17:30:57 INFO mapred.JobClient:     Launched reduce tasks=1

12/03/04 17:30:57 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=26218

12/03/04 17:30:57 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0

12/03/04 17:30:57 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0

12/03/04 17:30:57 INFO mapred.JobClient:     Rack-local map tasks=2

12/03/04 17:30:57 INFO mapred.JobClient:     Launched map tasks=2

12/03/04 17:30:57 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=16450

12/03/04 17:30:57 INFO mapred.JobClient:   File Input Format Counters

12/03/04 17:30:57 INFO mapred.JobClient:     Bytes Read=100000000

12/03/04 17:30:57 INFO mapred.JobClient:   File Output Format Counters

12/03/04 17:30:57 INFO mapred.JobClient:     Bytes Written=100000000

12/03/04 17:30:57 INFO mapred.JobClient:   FileSystemCounters

12/03/04 17:30:57 INFO mapred.JobClient:     FILE_BYTES_READ=204000288

12/03/04 17:30:57 INFO mapred.JobClient:     HDFS_BYTES_READ=100000230

12/03/04 17:30:57 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=306067170

12/03/04 17:30:57 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=100000000

12/03/04 17:30:57 INFO mapred.JobClient:   Map-Reduce Framework

12/03/04 17:30:57 INFO mapred.JobClient:     Map output materialized bytes=102000012

12/03/04 17:30:57 INFO mapred.JobClient:     Map input records=1000000

12/03/04 17:30:57 INFO mapred.JobClient:     Reduce shuffle bytes=51000006

12/03/04 17:30:57 INFO mapred.JobClient:     Spilled Records=3000000

12/03/04 17:30:57 INFO mapred.JobClient:     Map output bytes=100000000

12/03/04 17:30:57 INFO mapred.JobClient:     Total committed heap usage (bytes)=759627776

12/03/04 17:30:57 INFO mapred.JobClient:     CPU time spent (ms)=26240

12/03/04 17:30:57 INFO mapred.JobClient:     Map input bytes=100000000

12/03/04 17:30:57 INFO mapred.JobClient:     SPLIT_RAW_BYTES=230

12/03/04 17:30:57 INFO mapred.JobClient:     Combine input records=0

12/03/04 17:30:57 INFO mapred.JobClient:     Reduce input records=1000000

12/03/04 17:30:57 INFO mapred.JobClient:     Reduce input groups=1000000

12/03/04 17:30:57 INFO mapred.JobClient:     Combine output records=0

12/03/04 17:30:57 INFO mapred.JobClient:     Physical memory (bytes) snapshot=750379008

12/03/04 17:30:57 INFO mapred.JobClient:     Reduce output records=1000000

12/03/04 17:30:57 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=20332793856

12/03/04 17:30:57 INFO mapred.JobClient:     Map output records=1000000

12/03/04 17:30:57 INFO terasort.TeraSort: done

 

校验

hadoop jar hadoop-examples-0.20.205.0.jar teravalidate tera-out/part-00000 out

可以发现生成的文件大小为0,也就是说没有排序错误的key

 

时间对比

 

处理1GB数据cpu time

(ms)

 

2map

4map

8map

16map

 

Teragen

4460

5610

8340

13410

 

Terasort

19640

17420

21000

29260

 

 

 

处理10GB数据cpu time

(ms)

 

2map

4map

8map

16map

64map

Teragen

27010

31130

37770

42460

74920

Terasort

148450

150960

148490

147470

169530

后来发现我这里错了,因为teragen的参数不是指文件大小,而是指多少行!!所以这个表可以作废了

分析综合

其实它本身有job_history_summary.py写了一个脚本用于统计,我将它稍微改了下,从文件读取并且从文本输出,相当于从log文件中根据正则表达式匹配来获取对应的数据。

posted @ 2012-03-05 19:46  editice  阅读(1872)  评论(3)    收藏  举报