Nutch的安装和配置

Nutch是一个Java实现的网络爬虫。Nutch的安装可以使用二进制包，也可以使用源代码安装。这里介绍用二进制包安装。

1. 下载apache-nutch-1.12-bin.tar.gz，并且解压，解压后会形成一个apache-nutch-1.12文件夹；

2. 编辑conf/nutch-site.xml文件：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>http.agent.name</name>
        <value>My Spider</value>
    </property>

    <property>
        <name>plungin.folders</name>
        <value>/opt/apache-nutch-1.12/plugins</value>
    </property>

</configuration>

3. 进入apache-nutch-1.12文件夹，输入命令：

mkdir -p ./urls
cd urls
touch seed.txt

编辑seed.txt文件，加上你想要抓取的网站，如：

http://xxxx.com/

编辑conf/regex-urlfilter.txt文件，加上一个正则表达式：

# accept anything else
+^http://([a-z0-9]*\.)*xxxx.com/

这样，就会抓取http://xxxx.com/这个网站的所有网页。

4. 新建一个crawls目录，运行：

bin/crawl urls/seed.txt crawls 10

这样就可以进行抓取网页了，其中crawls是抓取数据存放的目录，10是轮数。

5. 抓取结束以后，会在crawls目录下产生三个文件夹：crawldb、linkdb、segments，使用下面的命令将二进制文件导出为文本文件：

bin/nutch readseg -dump ./crawls/segments/20170328163131 ./crawls/segments/2017032816313_dump

然后就可以用gedit打开文本文件查看抓取结果了。

posted @ 2017-03-28 18:36 MSTK 阅读(1377) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

代码空间

Computer Vision/Machine Learning/Evolutionary Computation...

Nutch的安装和配置

公告