sqoop安装配置教程,以及从Oracle数据抽取数据的分析

sqoop安装教程

1.使用背景以及介绍

  Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from arelational database management system (RDBMS) such as MySQL or Oracleinto the Hadoop Distributed File System (HDFS),transform the data in Hadoop MapReduce, and then export the data backinto an RDBMS

 

     介绍:上文引用自官方文档。意思很明显,sqoop就是将传统关系型数据库中的数据抽取到sqoop上,也能导回RDBMS。而且sqoop现在已经是apache的一个顶级项目,应该比较成熟了。

 

     具体使用和详细信息参考官方文档:http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html

2.下载解压

   因为我使用的hadoop是0.20.X的版本,sqoop不支持。sqoop支持cdh3的版本,不过没有关系,通过拷贝jar的方式可以弥补。

wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2-CDH3B4.tar.gz

wget http://archive.cloudera.com/cdh/3/sqoop-1.2.0-CDH3B4.tar.gz

tar -zxvf sqoop-1.2.0-CDH3B4.tar.gz
tar -zxvf hadoop-0.20.2-CDH3B4.tar.gz

cp hadoop-0.20.2-CDH3B4/hadoop-core-0.20.2-CDH3B4.jar  sqoop-1.2.0-CDH3B4/lib/

 

 

3.配置环境变量

 

vi .bash_profile

#内容如下
export SQOOP_HOME=/home/hadoop/sqoop-1.2.0-CDH3B4

export PATH=$PATH:$SQOOP_HOME/bin export CLASSPATH
=$CLASSPATH:$SQOOP_HOME/ojdbc6.jar:$SQOOP_HOME/lib/*

话说我一开始老是类找不到 可是东西明明在lib下了,非常的无奈,还好后来自己摸索出来这样强制指定jar的方式

 

4.尝试从Oracle数据库抽取300多万条记录

sqoop import --connect jdbc:oracle:thin:@192.168.1.65/wkm --username wkm --password wkm --target-dir /user/hadoop/wkmData2  --table C_PICRECORD_ALL  -m 1


用以上命令执行即可

--connect 后给出连接的数据库地址以及名字

--username  用户名

--password  密码

--targer-dir  hdfs上要保存的输出位置,目录自己会创建,不用我们操心

--table   用的ORACLE 一定要表名大写 因为sqoop是从oracle dictionary table里面读取的 

-m 1  设定1个并行度 ,如果多个并行度 必须确保有主键

 

 

输出结果:

注意:如果你的hadoop采用了mesos,不要把CPU和内存资源分配完,要不然会卡在map0%  reduce0%

 

------------------------------------------------1cpu 1G内存----------------------------------------------------------------------------

13/04/16 11:31:05 INFO mapreduce.ImportJobBase: Beginning import of C_PICRECORD_ALL
13/04/16 11:31:05 INFO manager.OracleManager: Time zone has been set to GMT
13/04/16 11:31:08 INFO mapred.JobClient: Running job: job_201304161006_0004
13/04/16 11:31:09 INFO mapred.JobClient:  map 0% reduce 0%
13/04/16 11:31:34 INFO mapred.JobClient:  map 100% reduce 0%
13/04/16 11:40:42 INFO mapred.JobClient: Job complete: job_201304161006_0004
13/04/16 11:40:42 INFO mapred.JobClient: Counters: 18
13/04/16 11:40:42 INFO mapred.JobClient:   Job Counters 
13/04/16 11:40:42 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=563490
13/04/16 11:40:42 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/04/16 11:40:42 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/04/16 11:40:42 INFO mapred.JobClient:     Launched map tasks=1
13/04/16 11:40:42 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
13/04/16 11:40:42 INFO mapred.JobClient:   File Output Format Counters 
13/04/16 11:40:42 INFO mapred.JobClient:     Bytes Written=503848416
13/04/16 11:40:42 INFO mapred.JobClient:   FileSystemCounters
13/04/16 11:40:42 INFO mapred.JobClient:     HDFS_BYTES_READ=87
13/04/16 11:40:42 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=27953
13/04/16 11:40:42 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=503848416
13/04/16 11:40:42 INFO mapred.JobClient:   File Input Format Counters 
13/04/16 11:40:42 INFO mapred.JobClient:     Bytes Read=0
13/04/16 11:40:42 INFO mapred.JobClient:   Map-Reduce Framework
13/04/16 11:40:42 INFO mapred.JobClient:     Map input records=3133366
13/04/16 11:40:42 INFO mapred.JobClient:     Physical memory (bytes) snapshot=99123200
13/04/16 11:40:42 INFO mapred.JobClient:     Spilled Records=0
13/04/16 11:40:42 INFO mapred.JobClient:     CPU time spent (ms)=169400
13/04/16 11:40:42 INFO mapred.JobClient:     Total committed heap usage (bytes)=18546688
13/04/16 11:40:42 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1521811456
13/04/16 11:40:42 INFO mapred.JobClient:     Map output records=3133366
13/04/16 11:40:42 INFO mapred.JobClient:     SPLIT_RAW_BYTES=87
13/04/16 11:40:43 INFO mapreduce.ImportJobBase: Transferred 480.5073 MB in 577.2022 seconds (852.456 KB/sec)
13/04/16 11:40:43 INFO mapreduce.ImportJobBase: Retrieved 3133366 records.
13/04/16 11:40:43 INFO util.AppendUtils: Appending to directory kakouData


------------------------------------2CPU 1.8G内存-------------------------------------------
13/04/16 13:57:41 INFO mapreduce.ImportJobBase: Beginning import of C_PICRECORD_ALL
13/04/16 13:57:41 INFO manager.OracleManager: Time zone has been set to GMT
13/04/16 13:57:43 INFO mapred.JobClient: Running job: job_201304161356_0001
13/04/16 13:57:44 INFO mapred.JobClient:  map 0% reduce 0%
13/04/16 13:58:10 INFO mapred.JobClient:  map 100% reduce 0%
13/04/16 14:07:14 INFO mapred.JobClient: Job complete: job_201304161356_0001
13/04/16 14:07:14 INFO mapred.JobClient: Counters: 18
13/04/16 14:07:14 INFO mapred.JobClient:   Job Counters 
13/04/16 14:07:14 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=557391
13/04/16 14:07:14 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/04/16 14:07:14 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/04/16 14:07:14 INFO mapred.JobClient:     Launched map tasks=1
13/04/16 14:07:14 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
13/04/16 14:07:14 INFO mapred.JobClient:   File Output Format Counters 
13/04/16 14:07:14 INFO mapred.JobClient:     Bytes Written=503848416
13/04/16 14:07:14 INFO mapred.JobClient:   FileSystemCounters
13/04/16 14:07:14 INFO mapred.JobClient:     HDFS_BYTES_READ=87
13/04/16 14:07:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=27436
13/04/16 14:07:14 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=503848416
13/04/16 14:07:14 INFO mapred.JobClient:   File Input Format Counters 
13/04/16 14:07:14 INFO mapred.JobClient:     Bytes Read=0
13/04/16 14:07:14 INFO mapred.JobClient:   Map-Reduce Framework
13/04/16 14:07:14 INFO mapred.JobClient:     Map input records=3133366
13/04/16 14:07:14 INFO mapred.JobClient:     Physical memory (bytes) snapshot=139059200
13/04/16 14:07:14 INFO mapred.JobClient:     Spilled Records=0
13/04/16 14:07:14 INFO mapred.JobClient:     CPU time spent (ms)=178130
13/04/16 14:07:14 INFO mapred.JobClient:     Total committed heap usage (bytes)=17367040
13/04/16 14:07:14 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1521811456
13/04/16 14:07:14 INFO mapred.JobClient:     Map output records=3133366
13/04/16 14:07:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=87
13/04/16 14:07:14 INFO mapreduce.ImportJobBase: Transferred 480.5073 MB in 573.1275 seconds (858.5166 KB/sec)
13/04/16 14:07:14 INFO mapreduce.ImportJobBase: Retrieved 3133366 records.

上图我分配了不同的调度资源,不过速度上好像没提升,可能这跟map和reduce的数量有关,或者我的数据量不够大。从上面可知总共有3133366条记录,花费了10分钟不到,转移到hdfs上的大小是480多M。

 

5.更多sqoop使用效率分析。。。

敬请期待。。。

posted @ 2013-03-29 00:12  KaimingWan  阅读(1498)  评论(0编辑  收藏  举报