1、Cygwin安装
我使用的是Cygwin本地安装版,local install,并把所有组件都设为installed即可。
2、解压nutch
将NUTCH-0.9解压后复制到HOME/Administrator下,或者在Cygwin下使用gunzip命令皆可。
3、安装JDK
可能是我的系统最近不正常吧,我的JDK必须安装在nutch目录下才能找到(正确设置了环境变量,可是只要安装在其它位置,就找不到JDK,如果哪位能知道原因,请指教,非常感谢!)我这里的安装路径是:C:\cygwin\home\Administrator\nutch-0.9\JDK
环境变量设置如下:
JAVA_HOME C:\cygwin\home\Administrator\nutch-0.9\JDK
CLASS_PATH 添加 ;C:\cygwin\home\Administrator\nutch-0.9\JDK\lib
NUTCH_JAVA_HOME C:\cygwin\home\Administrator\nutch-0.9\JDK
PATH 添加 ;C:\cygwin\home\Administrator\nutch-0.9\JDK\bin
4、使用爬虫之前的准备
首先在bin目录下新建目录urls,在urls中新建一个文本文家nutch.txt,将要抓取的网站地址输入,比如http://www.sina.com.cn/(注意最后的/一定要有)
打开conf\crawl-urlfilter.txt文件,将
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
改为
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*sina.com.cn/(这里也要有/呦)
打开nutch/conf/nutch-site.xml文件,修改<configuration></configuration>为:
<configuration>
<property>
<name>http.agent.name</name>
<value>HD nutch agent</value>
</property>
<property>
<name>http.agent.version</name>
<value>1.0</value>
</property>
</configuration>
保存
5、开始爬
进入nutch目录,进入bin目录
$sh nutch crawl urls -dir sina -depth 4 -threads 5 -topN 1000 >&logs/log1.log
crawl:通知nutch.jar,执行crawl的main方法。
urls:存放需要爬行的url.txt文件的目录
-dir sina 爬行后文件保存的位置
-depth 2:爬行次数,或者成为深度,不过还是觉得次数更贴切,建议测试时改为1。
-threads 指定并发的进程 这是设定为 4
-topN :一个网站保存的最大页面数。
**注意sina文件夹不能存在,会报错
曾经出现的错误:
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
解决方法:在urls的nutch.txt中再添加一个URL即可,原因现在还不知道。
6、挂上服务器
当爬取网页成功之后,开始配置TOMCAT
TOMCAT的安装目录是:C:\Program Files\Apache Software Foundation\Tomcat 5.5
TOMCAT_HOME C:\Program Files\Apache Software Foundation\Tomcat 5.5
CALSSPATH 添加: %TOMECAT_HOME%\bin;
在服务器关闭的状态下,删除TOMCAT中WEBAPPS文件夹中的ROOT文件夹,将nutch-0.9.war拷贝到webapps下,改名为ROOT.war,启动TOMCAT,会自动解压出ROOT文件。
修改/webapps/ROOT/WEB-INF/classes/nutch-site.xml:
将
<configuration>
</configuration>
换成
<configuration>
<property>
<name>searcher.dir</name>
<value>C:\cygwin\home\Administrator\nutch-0.9\bin\dlut</value>
</property>
</configuration>
为了支持中文需要修改tomcat的配置文件,打开tomcat\conf下的server.xml文件,将其中的Connector部分改成如下形式即可:
<Connector port="8080" maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
enableLookups="false"
redirectPort="8443"
acceptCount="100"
connectionTimeout="20000"
disableUploadTimeout="true"
URIEncoding="UTF-8" useBodyEncodingForURI="true" />
注意最后一行的两项是新加的.
利用tomcat搜索
重启tomcat,在浏览器中输入:http://127.0.0.1:8080
出现nutch搜索界面,
在搜索框中输入java并搜索,将看到你的搜索结果
***曾经出现的错误
org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value language + "/include/header.html" is quoted with " which must be escaped when used within the value
这个错误困扰了我很长时间,后来在http://news.skelter.net/articles/2008/09/24/nutch-0-9-quoted-with-must-be-escaped
找到解答
解决方法是:把search.jsp的第151行改成
<jsp:include page="<%= language + \"/include/header.html\"%>"/>。问题解决
测试环境
-
Nutch release 0.9
-
Eclipse 3.3 - aka Europa
-
Java 1.6
开始之前
Setting up Nutch to run into Eclipse can be tricky, and most of the time you are much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). However, it's very useful to be able to debug Nutch in Eclipse. But again you might be quicker by looking at the logs (logs/hadoop.log)...
配置步骤
安装Nutch
-
Grab a fresh release of Nutch 0.9 http://lucene.apache.org/nutch/version_control.html
-
Do not build Nutch now. Make sure you have no .project and .classpath files in the Nutch directory
在Eclipse中创建一个新的java工程,名字为nutch
-
File > New > Project > Java project > click Next
-
Name the project (nutch)
-
Select "Create project from existing source" and use the location where you downloaded Nutch
-
Click on Next, and wait while Eclipse is scanning the folders
-
Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
-
把nutch-0.9的conf添加到工程目录下,里面都是配置文件.单击conf文件夹,选择第三项,and folder conf to
build path,并且Default output folder 选择/nutch/conf。
-
配置nutch
-
为处理方便,直接在nutch工程下创建一个名为url.txt文件,然后在文件里添加要搜索的网址,例如:http://www.sina.com.cn/,注意网址最后的"/"一定要有。前面的"http://"也是必不可少的。
2.配置crawl-urlfilter.txt打开工程conf/crawl-urlfilter.txt文件,找到这两行
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
红色部分是一个正则,改写为如下形式
+^http://([a-z0-9]*\.)*com.cn/
+^http://([a-z0-9]*\.)*cn/
+^http://([a-z0-9]*\.)*com/ -
注意:“+”号前面不要有空格。
-
3.修改conf\nutch-site.xml为如下内容,否则不会抓取。
<configuration>
<property>
<name>http.agent.name</name> <value>*</value> </property>
</configuration>
-
在conf/nutch-defaul.xml下,将属性"plugin.folders"的值由“plugins”更改为 "./src/plugin"
缺少 org.farng and com.etranslate的解决方法
You will encounter problems with some imports in parse-mp3 and parse-rtf plugins (30 errors in my case). Because of incompatibility with Apache license they were left from sources. You can download them here:
http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/
http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. Then add them to the libraries to the build path (First refresh the workspace. Then Right click on the source folder => Java Build Path => Libraries => Add Jars).
配置Crawl.java运行环境
-
Menu Run > "Run..."
-
create "New" for "Java Application"
-
set in Main class
org.apache.nutch.crawl.Crawl
-
on tab Arguments, Program Arguments
urls -dir crawl -depth 3 -topN 50
( urls是存放入口地址的文件夹(在工程的根目录建新建一个urls的目录,
里面新建一个文本文件,也可以没有后缀名,在里面填写url 比如: http://www.163.com/),
-dir创建一个名为 crawl 的文件夹,里面就是我们抓取回来的数据存放地方
-depth 3 采集深度 3层 topN 最大页数
)
-
in VM arguments
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
-
click on "Run"
-
if all works, you should see Nutch getting busy at crawling

