nutch 使用过程遇到的问题汇总

1. 有时会发现使用nutch crawl urls总是没有效果

返回结果如下：

[root@localhost apache-nutch-2.1]# nutch crawl urls

FetcherJob: threads: 10

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : -1

Using queue mode : byHost

Fetcher: threads: 10

QueueFeeder finished: total 1 records. Hit by time limit :0

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

-finishing thread FetcherThread8, activeThreads=1

-finishing thread FetcherThread2, activeThreads=2

-finishing thread FetcherThread3, activeThreads=3

-finishing thread FetcherThread4, activeThreads=4

-finishing thread FetcherThread5, activeThreads=5

-finishing thread FetcherThread6, activeThreads=6

-finishing thread FetcherThread7, activeThreads=7

-finishing thread FetcherThread0, activeThreads=8

fetching http://www.csdn.com/

-finishing thread FetcherThread9, activeThreads=1

-finishing thread FetcherThread1, activeThreads=0

0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0.2 pages/s, 0 0 kb/s, 0 URLs in 0 queues

-activeThreads=0

并且持续如此，一直没有爬到有效结果：

换一个url试试或许就可以了，也许是自己拿这个网站做实验太长了，被对方给屏蔽ip了……

2. 有时会发现上次nutch crawl urls命令还可以正常使用，突然就不好用了，并出现下面错误：

[root@localhost urlsData]# nutch crawl urls/

Exception in thread "main" org.apache.gora.util.GoraException: org.apache.hadoop.hbase.MasterNotRunningException

at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)

at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)

at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)

at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214)

at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)

at org.apache.nutch.crawl.Crawler.run(Crawler.java:136)

at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)

Caused by: org.apache.hadoop.hbase.MasterNotRunningException

at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:394)

at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:94)

at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:108)

at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)

at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)

... 8 more

在运行jps查看运行的进程，发现HMaster和HRegionServer突然自己down掉了，如下

[root@localhost urlsData]# jps

14521 HQuorumPeer

14018 TaskTracker

15741 Crawler

13597 NameNode

16532 Jps

13907 JobTracker

13707 DataNode

15850 Crawler

16296 Main

13816 SecondaryNameNode

查看logs，会发现很多类似于如下的异常：

2012-12-15 17:42:40,428 FATAL org.apache.hadoop.hbase.master.HMaster: Shutting down HBase cluster: file system not available

org.apache.hadoop.io.MultipleIOException: 3 exceptions [org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /hbase/.META./1028785192/recovered.edits/0000000000000000005.temp could only be replicated to 0 nodes, instead of 1

此时你可以顺便看一下hadoop的logs，也会发现如下一些与上面类似的异常，这说明hadoop运行出现了问题。

然后你再关掉hbase，顺便重启一下hadoop，利用“./hadoop dfsadmin -safemode get”命令，你会发现hadoop一直初一safemode。为什么hadoop会处于安全模式呢？此时你可以去查看hadoop的logs去找一些蛛丝马迹。

除了如上面已经说过的错误之外，你还会发现如下信息：

2012-12-15 17:50:53,019 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode ON.

The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically.

之后就是我们之前见过的那个错误：

The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off automatically in 26 seconds.

org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /hadoop/hdata/tmp/mapred/system. Name node is in safe mode.

The ratio of reported blocks 1.0000 has reached the threshold 0.9990. Safe mode will be turned off automatically in 26 seconds.

at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1994)

于是我们会有如下疑问：

1. hadoop一开启的时候问什么需要进入safemode？

2. safemode有个倒计时的，问什么还没等其倒计时结束，hadoop就进行下一项功能，去删除hdfs上的name文件？

于是，我就上网专门搜索了一下关于hadoop开启这一部分的内容，发现有一篇介绍这部分过程的帖子：http://f.dataguru.cn/thread-34552-1-1.html

里面介绍说HDFS在为了提高其可靠性，采用了冗余副本的策略（当然，可以在hdfs-site.xml中通过设置dfs.replication来设置副本数量，通常设置为3，但是如果datanode数量小于3，则最多只能设置为datanode数量）。

理解了上面的冗余副本，我们再来看一下hadoop的安全模式：“在一定比例（可设置）的数据块被确定为“安全”后，再过若干时间，安全模式会结束”，下面我将详细说一下这个过程。

a. namenode是通过datanode的blockreport获得数据块在datanode上的存储情况

b. 当namenode收集到所有datanode的blockreport后，会计算每个block的可用副本情况，如果block的

可用副本数少于dfs.replication项指定的数量，则将这些副本数没有达到期望数值的block引用保存

在UnderReplicatedBlocks数据结构中

c. namenode能否离开safemode还取决于一个参数：dfs.namenode.safemode.threshold-pct，这个参数是

集群中可用副本数达到dfs.replication的blocks的比例的一个阀值。该参数大于等于1时，namenode

会一直处于safemode；这个参数小于等于0时，namenode不会进入safemode。这个值一般设置在

0-1之间，默认为0.999f。Specifies the percentage of blocks that should satisfy the minimal replication

requirement defined by dfs.replication.min. Values less than or equal to 0 mean not to start in safe mode.

d. 当检查完并发现datanode的合格情况符合要求时，namenode会等待一段时间（用户可以通过设置

dfs.safemode.extension项来控制该时间，默认值是30000毫秒）。然后安全模式将会结束。

但是如果检查不合格，之后系统将会一直处于安全模式。

参考http://wiki.apache.org/hadoop/FAQ#Does_the_name-node_stay_in_safe_mode_till_all_under-replicated_files_are_fully_replicated.3F

基于以上学习，也就不难明白我这里出现错误的原因了。

原因分析：由于我采用的是伪分布式hadoop，所以只有一个datanode结点，数据的备份也只有一份，当其中某些数据损坏（这个很容易，我发现当并行多个shell分别运行nutch时就会很容易出现数据写坏），且损坏程度超过dfs.safemode.extension中设置的时候，就会导致hadoop一直处于安全模式并无法离开，这也就导致了在logs里面我们常见到的”org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /hbase /.META./ 1028785192/ recovered.edits/ 0000000000000000005.temp could only be replicated to 0 nodes, instead of 1“错误。进而导致HMaster和HRegionServer down掉。

问题解决：由于我还得采用单节点伪分布式的方式运行hadoop（物理条件制约），所以，我只能先修改dfs.safemode.threshold.pct(我的1.0.3版本是这个名字)的值，使其为0，这样hadoop就永远不会进入安全模式了。当然也随之会带来数据丢失等其他错误，这里就先不管了……

不过这样改完之后还有如下异常：

java.io.IOException: File /hadoop/hdata/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1

因为丢掉的信息在伪藏也是丢掉了，没有办法，貌似只能格调namenode和datanode重启hadoop了。

看来唯一的避免方式是以后千万不能多个shell并行使用nutch了……

此处理解还有错误，貌似不是因为并行导致的错……每次nutch到结尾时都会出很多error，不知道什么原因，下次在运行就不行了，明天再看吧……

丫丫的，今天终于发现问题了，原来是我的磁盘不够用了，欲哭无泪啊……白浪费了那么多青春……

3. 当真的采用分布式方式部署hadoop，然后在其上运行nutch时，要确保以下几点：

a. 关闭防火墙

b. 是不是所有的datanode都启动了

c. 查看hadoop是否在safemode下

posted on 2013-06-17 02:32 刺猬的温驯阅读(325) 评论(0) 收藏举报

刷新页面返回顶部

君子博学而日参省乎己则知明而行无过矣

公告

君子博学而日参省乎己 则知明而行无过矣

公告

君子博学而日参省乎己则知明而行无过矣