使用sqoop从mysql导入数据到hive

前言

这篇文章主要是用sqoop从mysql导入数据到hive时遇到的坑的总结。

环境：

系统：Centos 6.5
Hadoop：Apache，2.7.3
Mysql：5.1.73
JDK：1.8
Sqoop：1.4.7

Hadoop以伪分布式模式运行。

一、使用的导入命令

我主要是参考一篇文章去测试的，Sqoop: Import Data From MySQL to Hive。

参照里面的方法，在mysql建了表，填充了数据，然后按照自己的情况输入了命令：

sqoop import --connect jdbc:mysql://localhost:3306/test --username root -P --split-by id --columns id,name --table customer  --target-dir /user/cloudera/ingest/raw/customers --fields-terminated-by "," --hive-import --create-hive-table --hive-table sqoop_workspace.customers

然后开始了打地鼠之旅。

二、遇到的问题及解决

1. 用文本字段进行分区的问题

报错信息:"Generating splits for a textual index column allowed only in case of "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" property passed as a parameter"。

主要问题是“--split-by id”这个参数指定的id是一个文本格式，所以需要在命令中加入选项"-Dorg.apache.sqoop.splitter.allow_text_splitter=true"，补齐命令：

sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" --connect jdbc:mysql://localhost:3306/test --username root -P --split-by id --columns id,name --table customer  --target-dir hdfs://harry.com:9000/user/cloudera/ingest/raw/customers --fields-terminated-by "," --hive-import --create-hive-table --hive-table sqoop_workspace.customers

2. Hadoop历史服务器Hadoop JobHistory没开启的问题

报错信息：“ERROR tool.ImportTool: Import failed: java.io.IOException: java.net.ConnectException: Call From harry.com/192.168.0.210 to 0.0.0.0:10020 failed on connection exception: …”。

主要原因是sqoop在运行完MapReduce任务之后需要用hadoop jobhistory记录这些作业信息并存放在指定的HDFS目录下，默认情况下是没有启动的，需要配置完后手工启动服务。

解决方法：mapred-site.xml添加如下配置：

<property>
    <name>mapreduce.jobhistory.address</name>
    <value>0.0.0.0:10020</value>
</property>

<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>0.0.0.0:19888</value>
</property>

<property>
    <name>mapreduce.jobhistory.done-dir</name>
    <value>/history/done</value>
</property>

<property>
    <name>mapreduce.jobhistory.intermediate-done-dir</name>
    <value>/history/done_intermediate</value>
</property>

重启hdfs和yarn:

stop-dfs.sh
stop-yarn.sh
start-dfs.sh
start-yarn.sh

启动历史服务器：

$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

后面如果需要停止，可以用命令：

$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver

然后重新执行命令。

3. 连接元数据存储数据库报错

报错信息：“Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url…”。

主要是因为我用的hive元数据数据库是默认的Derby，我在另外一个会话开了一个hive cli来访问hive，而Derby的一个缺点就是当有多个用户同时访问hive的时候，会报错。

解决：退出hive cli，重新运行

4. 没有在hive创建数据库

报错信息：“ERROR ql.Driver: FAILED: SemanticException [Error 10072]: Database does not exist: sqoop_workspace…”，这个就很明显了，直接在hive里创建一个就是了。

5. 其他警告和报错

其他报错实际上没有阻碍导入过程，譬如下面这个WARN：

“WARN hdfs.DFSClient: Caught exception java.lang.InterruptedException…”，实际上是hadoop自己的bug，具体是HDFS 9794的bug：

当关闭DFSStripedOutputStream的时候，如果在向data/parity块刷回数据失败的时候，streamer线程不会被关闭。同时在DFSOutputStream#closeImpl中也存在这个问题。DFSOutputStream#closeImpl总是会强制性地关闭线程，会引起InterruptedException。

这些报错一般忽略就可以了。

三、一些补充

1. 为什么使用--split-by

在workflow上有个讨论说得很明白：

--split-by : It is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism. Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits.

--split-by is used to distribute the values from table across the mappers uniformly i.e. say u have 100 unique records(primary key) and if there are 4 mappers, --split-by (primary key column) will help to distribute you data-set evenly among the mappers.

实际上使用—split-by参数，是为了运行map任务的时候，能够更好的分发数据。